Closed marcverhagen closed 7 years ago
Here is a minimum input example
<DOC id='1'>Simple Test</DOC>
If you run Tarsqi with
python tarsqi.py --pipeline=PREPROCESSOR m.xml m2.ttk
Then your output is
<ttk>
<text>Simple Test
</text>
<metadata>
<dct value="20170330"/>
</metadata>
<source_tags>
<DOC id="1" begin="0" end="11" id="1" />
</source_tags>
<tarsqi_tags>
<docelement id="d1" begin="0" end="12" origin="DOCSTRUCTURE" type="paragraph" />
<ng id="c1" begin="0" end="11" origin="PREPROCESSOR" />
<s id="s1" begin="0" end="11" origin="PREPROCESSOR" />
<lex id="l1" begin="0" end="6" lemma="simple" origin="PREPROCESSOR" pos="JJ" text="Simple" />
<lex id="l2" begin="7" end="11" lemma="test" origin="PREPROCESSOR" pos="NN" text="Test" />
</tarsqi_tags>
</ttk>
This was actually also a problem in case 'begin' or 'end' is an existing attribute, this was solved too.
If the XML input to the pipeline has a tag with the id attribute then the output will have that attribute repeated, for example:
<DOC id="1" begin="2" end="596" id="34" />
This is because the toolkit adds its own identifiers, make it stop doing that.
The bad thing is that this will break downstream XML processing.
In general, TTK's handling of identifiers needs to be revisited.