Duplicate attribute in TTK format

marcverhagen commented 7 years ago

If the XML input to the pipeline has a tag with the id attribute then the output will have that attribute repeated, for example:

<DOC id="1" begin="2" end="596" id="34" />

This is because the toolkit adds its own identifiers, make it stop doing that.

The bad thing is that this will break downstream XML processing.

In general, TTK's handling of identifiers needs to be revisited.

marcverhagen commented 7 years ago

Here is a minimum input example

<DOC id='1'>Simple Test</DOC>

If you run Tarsqi with

python tarsqi.py --pipeline=PREPROCESSOR m.xml m2.ttk

Then your output is

<ttk>
<text>Simple Test
</text>
<metadata>
  <dct value="20170330"/>
</metadata>
<source_tags>
  <DOC id="1" begin="0" end="11" id="1" />
</source_tags>
<tarsqi_tags>
  <docelement id="d1" begin="0" end="12" origin="DOCSTRUCTURE" type="paragraph" />
  <ng id="c1" begin="0" end="11" origin="PREPROCESSOR" />
  <s id="s1" begin="0" end="11" origin="PREPROCESSOR" />
  <lex id="l1" begin="0" end="6" lemma="simple" origin="PREPROCESSOR" pos="JJ" text="Simple" />
  <lex id="l2" begin="7" end="11" lemma="test" origin="PREPROCESSOR" pos="NN" text="Test" />
</tarsqi_tags>
</ttk>

marcverhagen commented 7 years ago

This was actually also a problem in case 'begin' or 'end' is an existing attribute, this was solved too.

tarsqi / ttk

Duplicate attribute in TTK format #32