ufal / media-irozhlas

0 stars 0 forks source link

wrong tokenization cause errors in ids #13

Closed matyaskopp closed 2 years ago

matyaskopp commented 2 years ago

This will be probably fixed with #12

$ sed -n "/invalid XML/{N;p}" releases/20210927163532/annotate.log 
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-25.xml
:1406: validity error : ID doc-8504166.p2.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-68.xml
:3412: validity error : ID doc-8366498.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-121.xml
:1654: validity error : ID doc-8296059.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-205.xml
:1040: validity error : ID doc-8220534.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-367.xml
:907: validity error : ID doc-8048593.p13.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-472.xml
:2434: validity error : ID doc-8461269.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-504.xml
:336: validity error : ID doc-7947359.p13.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-540.xml
:2497: validity error : ID doc-7902654.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-572.xml
:2317: validity error : ID doc-7804251.p4.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-603.xml
:1472: validity error : ID doc-7775871.p27.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-645.xml
:2447: validity error : ID doc-7586783.p3.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-681.xml
:2145: validity error : ID doc-8426969.p2.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-738.xml
:1456: validity error : ID doc-7407565.p1.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-934.xml
:2619: validity error : ID doc-6638832.p5.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-980.xml
:2549: validity error : ID doc-6309951.p6.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-1008.xml
:4834: validity error : ID doc-5995242.p11.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-1087.xml
:2539: validity error : ID doc-8404268.p6.s1 already defined
 -- invalid XML in /opt/irozhlas/data/data-out/udpipe/corpus-1107.xml
:2314: validity error : ID doc-5956135.p4.s1 already defined
matyaskopp commented 2 years ago

not fixed !!!

grep -o '[^<]*doc-8504166.p2.s1"[^>]*><!--[^>]*>' /opt/irozhlas/data/data-out/udpipe/corpus-28.xml
s xml:id="doc-8504166.p2.s1"><!-- Čeští horolezci Marek Holeček a Radoslav Groh vylezli alpským stylem v Nepálu na horu Baruntse, která je vysoká 7129 metrů. -->
s xml:id="doc-8504166.p2.s1"><!-- Horolezci se spojili z Káthmándú s Lucií Výbornou a byli hosty středečního vysílání Radiožurnálu. -->
matyaskopp commented 2 years ago

see: /opt/irozhlas/data/data-out/udpipe/corpus-71.xml

udpipe do not persist paragraphs (https://github.com/ufal/ParCzech/issues/151): concatenate these two articles:

matyaskopp commented 2 years ago

fixed: