TEES on new dataset that have the same format as DDI corpus

jbjorne / TEES

Turku Event Extraction System

147 stars 44 forks source link

TEES on new dataset that have the same format as DDI corpus #24

Closed arwhirang closed 7 years ago

arwhirang commented 7 years ago

Hello, I found that NLM corpus is pretty much same as the DDI corpus.

I tried to apply the TEES code to the data, but the conversion process requires some pre-processed data which is available for download only (in the case of DDI).

Is there any way to apply TEES preprocessing to a new dataset?

jbjorne commented 7 years ago

The TEES preprocessor can be used to prepare any data in TXT or Interaction XML format for use with the TEES training or classification programs.

arwhirang commented 7 years ago

Thank you for the fast response. Though this is the end of this issue, I will post the result of the preprocessor.

arwhirang commented 7 years ago

I have tried the preprocessor, but it seems that the preprocessor is not fully adjusted for the DDI format xml. For the test try, I tried to pre-process an xml file from the original DDI'11 data, "Abciximab_ddi.xml". But I faced errors which I found a little fixes will lead me nowhere. (For example, I fixed some errors and encountered this line, in the file GeniaSentenceSplitter.py sentenceOffset = Range.charOffsetToSingleTuple(sentence.get("charOffset")) The sentence object do not have the attribute "charOffset" ... )

I would really like to try the TEES, but the preprocessing is not very welcoming. What do you suggest for me to deal with this case?

jbjorne commented 7 years ago

The preprocessor input must be either TXT or Interaction XML files. The DDIExtraction Shared Task format (such as the file "Abciximab_ddi.xml") is related to Interaction XML, but is not exactly the same file format.

In order to use the preprocessor, you must convert your data into either TXT files or into Interaction XML files. For more documentation on Interaction XML please see https://github.com/jbjorne/TEES/wiki/Interaction-XML. You can also look at the corpus files installed by TEES (by default these can be found at ~/.tees/corpora) for more examples of Interaction XML files.

arwhirang commented 7 years ago

Oh, thank you. My mistake on not seeing the Interaction-XML wiki.