ufal / ParCzech

ParCzech is a project on compiling Czech parliamentary data into annotated corpora.
https://ufal.mff.cuni.cz/parczech
0 stars 1 forks source link

audio alignment anchors #208

Open matyaskopp opened 1 month ago

matyaskopp commented 1 month ago

problem

Currently, the audio alignment follows this structure:

<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w1.ab"/>
<w xml:id="ps2013-001-01-000-999.u1.p1.s1.w1" lemma="vážený" pos="ADJ" msd="UPosTag=ADJ|Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass" ana="pdt:AAFS5----1A----">Vážení</w>
<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w1.ae"/> 

<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w2.ab"/>
<w xml:id="ps2013-001-01-000-999.u1.p1.s1.w2" lemma="paní" pos="NOUN" msd="UPosTag=NOUN|Case=Voc|Gender=Fem|Number=Sing|Polarity=Pos" ana="pdt:NNFS5-----A----">paní</w>
<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w2.ae"/> 

<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w3.ab"/>
<w xml:id="ps2013-001-01-000-999.u1.p1.s1.w3" lemma="poslankyně" pos="NOUN" msd="UPosTag=NOUN|Case=Voc|Gender=Fem|Number=Sing|Polarity=Pos" ana="pdt:NNFS5-----A----" join="right">poslankyně</w>
<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w3.ae"/>

<pc xml:id="ps2013-001-01-000-999.u1.p1.s1.w4" lemma="," pos="PUNCT" msd="UPosTag=PUNCT" ana="pdt:Z:-------------">,</pc> 

<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w5.ab"/>
<w xml:id="ps2013-001-01-000-999.u1.p1.s1.w5" lemma="vážený" pos="ADJ" msd="UPosTag=ADJ|Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass" ana="pdt:AAMP5----1A----">vážení</w>
<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w5.ae"/> 

<!-- ... -->

Every aligned token is wrapped with two anchors :

This is not very good because it expects specific suffixes in @synch and also the adjected placement.

solution

So, the proposal is to add a @corresp attribute to the anchor that would point to the corresponding token:

<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w1.ab" corresp="ps2013-001-01-000-999.u1.p1.s1.w1"/>
<w xml:id="ps2013-001-01-000-999.u1.p1.s1.w1" lemma="vážený" pos="ADJ" msd="UPosTag=ADJ|Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass" ana="pdt:AAFS5----1A----">Vážení</w>
<anchor synch="#ps2013-001-01-000-999.u1.p1.s1.w1.ae" corresp="ps2013-001-01-000-999.u1.p1.s1.w1"/> 

Notes: