iljackb / Mixtepec_Mixtec

Mostly XML (TEI) markup of Mixtepec-Mixtec Language resources
3 stars 1 forks source link

What form to point at in standoff annotation (IPA or orthographic)? #88

Open iljackb opened 4 years ago

iljackb commented 4 years ago

Praat transcriptions have both an orthographic and phonetic output in the TEI files but the question of which one to point to in annotation is difficult..

            <u n="1" xml:id="d23e0" start="2.04" end="3.77">
               <seg xml:lang="mix" notation="orth" xml:id="T-seg-orth-2.04">
                  <w synch="#T2.56" xml:id="T-orth2.56">naá</w>
               </seg>
               <seg xml:lang="mix" notation="ipa" xml:id="T-seg-pron-2.04" sameAs="#T-orth2.56">
                  <w synch="#T2.56" xml:id="T-pron2.56" sameAs="#T-orth2.56">na˩a↗</w>
               </seg>
            </u>

When annotating (using pointers in ), I originally would point to the id's of both the phonetic and orthographic forms, thus for the first example, the grammar annotation is:

            <spanGrp type="gram">
               <span type="pos" target="#d1e38 #d1e41" ana="#N"/>
            </spanGrp>

However in the case that the annotation is a phrase, compound or multi-word expression in which it is necessary to point to multiple spans, including each of the pointers from both the IPA and orthographic forms makes for a messy annotation and a really problematic annotation to then go and have to parse when trying to harvest the information. e.g. see the following annotation for the compound of the noun xini nta'a "finger" (compound of "head" + "hand"):

            <u n="1" xml:id="d23e0" start="2.04" end="3.77">
               <seg xml:lang="mix" notation="orth" xml:id="T-seg-orth-2.04">
                  <w synch="#T2.56" xml:id="T-orth2.56">xini</w>
                  <w synch="#T2.75" xml:id="T-orth2.75">nta'a</w>
               </seg>
               <seg xml:lang="mix" notation="ipa" xml:id="T-seg-pron-2.04" sameAs="#T-orth2.56">
                  <w synch="#T2.56" xml:id="T-pron2.56" sameAs="#T-orth2.56">ʃi˧ni↗</w>
                  <w synch="#T2.75" xml:id="T-pron2.75" sameAs="#T-orth2.75">nda˧a˩</w>
               </seg>
            </u>

For which the annotation is the incredibly cumbersome:

       <spanGrp type="gram">
          <span type="pos" target="#T-orth2.56 #T-orth2.75 #T-pron2.56 #T-pron2.75" ana="#N"/>
       </spanGrp>

So the obvious solution is to only annotate one of the two and use @sameAs to draw the connection between the different representations of the same information (as I have done on the IPA forms in the example).

In cases like the following, it works perfect (note the segment "yu" marks 1st.sg) and the grammatical annotation can simply point right to the orthographic form:

            <u n="4" xml:id="d62e0" start="6.89" end="11.59" who="#TS">
               <seg xml:lang="mix" function="utterance" notation="orth" xml:id="T-seg-orth-6.89">
                  <w synch="#T7.33" xml:id="T-orth7.33">chikaa</w>
                  <w synch="#T7.68" xml:id="T-orth7.68">yu</w>
                  <w synch="#T7.87" xml:id="T-orth7.87">tutu</w>
                  <w synch="#T8.45" xml:id="T-orth8.45">nuu</w>
                  <w synch="#T9.15" xml:id="T-orth9.15">ñuꞌu</w>
               </seg>
               <seg xml:lang="mix" function="utterance" notation="ipa" xml:id="T-seg-pron-6.89" sameAs="#T-seg-orth-6.89">
                  <w synch="#T7.33" xml:id="T-pron7.33" sameAs="#T-orth7.33">tʃika</w>
                  <w synch="#T7.68" xml:id="T-pron7.68" sameAs="#T-pron7.68">ju˩</w>
                  <w synch="#T7.87" xml:id="T-pron7.87" sameAs="#T-pron7.87">tu̪t̪u˩</w>
                  <w synch="#T8.45" xml:id="T-pron8.45" sameAs="#T-orth8.45">nũũ↗</w>
                  <w synch="#T9.15" xml:id="T-pron9.15" sameAs="#T-pron9.15">ɲũ˥ʔõ˩</w>
               </seg>
            </u>
               ....
            <spanGrp type="gram">
               <span type="pos" target="#T-orth7.33" ana="#V"/>
               <span type="transitivity" target="#T-orth7.33" ana="#TRANS"/>
               <span type="macrorole" target="#T-orth7.68" ana="#A"/>
               <span type="macrorole" target="#T-orth7.87" ana="#U"/>
               <span type="pos" target="#T-orth7.68" ana="#ENCLT"/>
               <span type="person" target="#T-orth7.68" ana="#1PERS"/>
               <span type="number" target="#T-orth7.68" ana="#SG"/>
               <span type="pos" target="#T-orth7.87" ana="#N"/>
               <span type="pos" target="#T-orth8.45" ana="#ADPOS"/>
               <span type="pos" target="#T-orth9.15" ana="#N"/>
            </spanGrp>

However, the problem is that in many words, there is a tone change that marks the person (or tense, or other aspect) and these are usually not marked in the orthographic form such as in verbs such as the following sketa which is spelled the same in 1st.sg present tense as well as the gloss/lemma form and the only difference is on the phonetic level in the tone change on the final vowel, e.g :

               <seg xml:lang="mix" xml:id="d1e113" notation="orth">
                  <w xml:id="d1e114" synch="#T14">sketa</w>
                  <w xml:id="d1e116" synch="#T19">ntikii</w>
               </seg>
               <seg xml:lang="mix" xml:id="d1e118"  notation="ipa">
                  <w xml:id="d1e119" synch="#T14">skɛ˥t̪a↘</w>
                  <w xml:id="d1e132" synch="#T19">nd̪i↘kiː↘↗ꜛ</w>
               </seg>
            </u>

So in a case like this, I could point to the final vowel on the verb and tag as 1sg because in the paradigm of present/incompletive tense/aspect the 1sg is the only one to have the -a, but this would miss annotating the tone, which is very significant linguistically.

An important issue to consider is that the contents from written sources are obviously only orthographic and thus only annotate one of the forms for each and thus there is already the precedent that more than half of the materials is going to be annotating the orthographic content anyway..

My proposal: by default I annotate the orthographic form (and have the @sameAs on the corresponding IPA forms as shown above) but in the case like the above where the tone makes a minimal distinction, I point to the tone which I'll mark as <m> (and the vowel it occurs on depending on the specific case) the for only the given feature as follows:

               <seg xml:lang="mix" xml:id="d1e113" notation="orth">
                  <w xml:id="T-orth-d1e114" synch="#T14">sketa</w>
                  <w xml:id="T-orth-d1e116" synch="#T19">ntikii</w>
               </seg>
               <seg xml:lang="mix" xml:id="d1e118" notation="ipa">
                  <w xml:id="T-pron-d1e114" synch="#T14" sameAs="T-orth-d1e114">skɛ˥t̪<m xml:id="d1e120">a↘</w>
                  <w xml:id="T-pron-d1e116" synch="#T19" sameAs="T-orth-d1e116">nd̪i↘kiː↘↗ꜛ</w>
               </seg>
            </u>

            <spanGrp type="gram">
               <span type="pos" target="#T-orth-d1e114" ana="#V"/>
               <span type="transitivity" target="#T-orth-d1e114" ana="#INTRANS"/>
               <span type="macrorole" target="#d1e120" ana="#A"/>
               <span type="person" target="#d1e120" ana="#1PERS"/>
               <span type="number" target="#d1e120" ana="#SG"/>
              <span type="pos" target="#T-orth-d1e116" ana="#ADV-TEMP"/>
            </spanGrp>

What do you think?