tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

Redefine what primary source means #18

Open marcverhagen opened 8 years ago

marcverhagen commented 8 years ago

With XML input, TTK now considers the input without the XML to be the primary source and will actually create that input from the XML. This was an arbitrary decision made a couple of years ago.

This does not play nice for integration of TTK with a UIMA pipeline at the VA end of things, where the XML is considered part of the primary document by UIMA. This causes offsets to be out-of-sync.

Rather than changing the source documents we will update TTK to do it the same way (or perhaps have an option). This at least includes changes to the source parser and the tokenizer.

Also, keep the source as a separate document and create a new document with the annotations (including tags that may be in the source).