ufal / ParCzech

ParCzech is a project on compiling Czech parliamentary data into annotated corpora.
https://ufal.mff.cuni.cz/parczech
0 stars 1 forks source link

Validation error - missing speaker and u/@ana #189

Closed matyaskopp closed 2 years ago

matyaskopp commented 2 years ago
ERROR ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999: Element u with empty @who ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.u1
/home/parczech/ParlaMint/DataCZ.raw/ParlaMint-CZ/2022/ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.xml:109:81: error: element "u" missing required attribute "ana"
matyaskopp commented 2 years ago
   <text ana="#covid">
      <body>
         <div type="debateSection">
            <pb source="https://www.psp.cz/eknih/2021ps/stenprot/220615/s901001.htm" n="1" xml:id="ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.pb1" corresp="#ps2021-000-01-000-999.audio1"/>
            <u who="" xml:id="ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.u1">
               <seg xml:id="ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.u1.p1">Vystoupení prezidenta Ukrajiny J. E. pana Volodymyra Zelenského před oběma komorami Parlamentu České republiky</seg>
            </u>
matyaskopp commented 2 years ago

TODO:

Current values:

raw 2022/ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.xml

         <extent>
            <measure unit="speeches" quantity="7" xml:lang="cs">7 promluv</measure>
            <measure unit="speeches" quantity="7" xml:lang="en">7 speeches</measure>
            <measure unit="words" quantity="3052" xml:lang="cs">3052 slov</measure>
            <measure unit="words" quantity="3052" xml:lang="en">3052 words</measure>
         </extent>
            <namespace name="http://www.tei-c.org/ns/1.0">
               <tagUsage gi="text" occurs="1"/>
               <tagUsage gi="body" occurs="1"/>
               <tagUsage gi="div" occurs="1"/>
               <tagUsage gi="note" occurs="12"/>
               <tagUsage gi="pb" occurs="4"/>
               <tagUsage gi="u" occurs="7"/>
               <tagUsage gi="seg" occurs="38"/>
               <tagUsage gi="kinesic" occurs="2"/>
               <tagUsage gi="vocal" occurs="0"/>
               <tagUsage gi="incident" occurs="0"/>
               <tagUsage gi="gap" occurs="0"/>
               <tagUsage gi="desc" occurs="2"/>
               <tagUsage gi="time" occurs="5"/>
            </namespace>

ana 2022/ParlaMint-CZ_2022-06-15-ps2021-000-01-000-999.ana.xml

         <extent>
            <measure unit="speeches" quantity="7" xml:lang="cs">7 promluv</measure>
            <measure unit="speeches" quantity="7" xml:lang="en">7 speeches</measure>
            <measure unit="words" quantity="3052" xml:lang="cs">3052 slov</measure>
            <measure unit="words" quantity="3052" xml:lang="en">3052 words</measure>
         </extent>
            <namespace name="http://www.tei-c.org/ns/1.0">
               <tagUsage gi="text" occurs="1"/>
               <tagUsage gi="body" occurs="1"/>
               <tagUsage gi="div" occurs="1"/>
               <tagUsage gi="note" occurs="12"/>
               <tagUsage gi="pb" occurs="4"/>
               <tagUsage gi="u" occurs="7"/>
               <tagUsage gi="seg" occurs="38"/>
               <tagUsage gi="kinesic" occurs="2"/>
               <tagUsage gi="vocal" occurs="0"/>
               <tagUsage gi="incident" occurs="0"/>
               <tagUsage gi="gap" occurs="0"/>
               <tagUsage gi="desc" occurs="2"/>
               <tagUsage gi="s" occurs="143"/>
               <tagUsage gi="name" occurs="186"/>
               <tagUsage gi="time" occurs="5"/>
               <tagUsage gi="date" occurs="14"/>
               <tagUsage gi="unit" occurs="0"/>
               <tagUsage gi="num" occurs="6"/>
               <tagUsage gi="email" occurs="0"/>
               <tagUsage gi="ref" occurs="0"/>
               <tagUsage gi="w" occurs="3096"/>
               <tagUsage gi="pc" occurs="508"/>
               <tagUsage gi="linkGrp" occurs="143"/>
               <tagUsage gi="link" occurs="3582"/>
            </namespace>
matyaskopp commented 2 years ago

Done, fixed in all stages of data in the pipeline