Wrong insertion of closing </g> tags for <g> tag pairs that span zero tokens

GoogleCodeExporter commented 9 years ago

With reinsert.pm r116:

Tokenizer ir <g id="0"> programma , kas <g id="1"> </g> <g id="2"> sadala <g 
id="3"> </g> </g> ievadīto & $ tekstu teikumos , un teikumus vārdos14 . </g>

Tokenizer |0-0| programma |2-2| have |1-1| to be |4-4| , the |3-3| sadala
|5-5| ievadīto |6-6| & |7-7| $ |8-8| tekstu |9-9| teikumos |10-10| ,
|5-5| |11-11|
and the |12-12| teikumus |13-13| vārdos14 |14-14| . |15-15|

Result is:
Tokenizer <g id="0"> programma have to be , the <g id="1"> <g id="2"> sadala 
</g> <g id="3"> ievadīto & $ tekstu teikumos , and the teikumus vārdos14 .
</g> </g> </g>

But it should be:
Tokenizer <g id="0"> programma have to be , the <g id="1"> </g>  <g id="2"> 
sadala <g id="3"> </g> </g> ievadīto & $ tekstu teikumos , and the teikumus
vārdos14 </g>

Original issue reported on code.google.com by Achi...@gmail.com on 31 Jan 2012 at 11:35

GoogleCodeExporter commented 9 years ago

Fixed in reinsert.pm r116 with result:
Tokenizer <g id="0"> programma have to be , the <g id="1"> <g id="2"> </g> 
sadala </g> <g id="3"> </g> ievadīto & $ tekstu teikumos , and the teikumus 
vārdos14 . </g>

Not exactly what expected, but the current algorithm cannot:
1. Output opening and closing tags in a specific order before before a phrase, 
ie. it cannot output "<g id="1"> </g> <g id="2">". It first outputs all opening 
tags, then outputs all closing tags before a phrase, then the phrase, then all 
closing tags after a phrase. Note that an order cannot necessarily determined: 
the combination of tag pairs around target phrases is different from the 
source. If you need strict tag order, you can use an alternative mechanism with 
wrap_markup.pm (this also prevents any phrase reordering across markup).
2. Close <g id="2"> after <g id="3">. The former is associated with the phrase 
"sadala" only, so it needs to be closed after that phrase. <g id="3"> is 
associated with the phrase starting with "ievadīto"

Fixing this further would be a feature request, but it has the problem already 
described in 1. and basically tag combinations '<g id="1"> </g>' without a 
token in between them should not really happen. These should really be isolated 
tags '<x id="1"/>'.

Original comment by Achi...@gmail.com on 31 Jan 2012 at 11:40

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Correction: fixed with reinsert.pm r119

Original comment by Achi...@gmail.com on 31 Jan 2012 at 11:42

lvapeab / m4loc

Wrong insertion of closing </g> tags for <g> tag pairs that span zero tokens #33