iljackb / Mixtepec_Mixtec

Mostly XML (TEI) markup of Mixtepec-Mixtec Language resources
3 stars 1 forks source link

corpus2TeiDict-link-span-test.xsl need to limit match of $target xml:id to avoid false matches #68

Open iljackb opened 5 years ago

iljackb commented 5 years ago

In this stylesheet, I take the 's in the document (testing on /Aves.xml) and create TEI dictionary entries. The test document is bird names. The annotations being transferred to the dictionary are:

                 <spanGrp type="semantics">
                     <span type="sense"
                           target="#d1e135 #d1e141 #d1e145"
                           corresp="https://www.wikidata.org/wiki/Q651545"/>
                     <span type="sense"
                           target="#d1e135 #d1e141 #d1e145"
                           corresp="http://dbpedia.org/resource/American_kestrel"/>
                              ....
                  </spanGrp>

Because some birds have multiple names in Mixtec, and thus may require the <span> and <link> pointers to contain more than one pointer: e.g.

                     <span type="sense"
                           target="#d1e135 #d1e141 #d1e145"
                           corresp="http://dbpedia.org/resource/American_kestrel"/>

I have to use @contains in defining the key data categories: e.g.

  <xsl:variable name="wSense1"
    select="$readDoc/descendant::spanGrp[@type = 'semantics']/span[@type='sense'] 
   [contains(@target,$target)]/@corresp"/>

I define the key variable of $target as follows:

<xsl:variable name="target" as="xs:string" select="concat('#',$wID)"/>

However, this leads to the problem of false matches:

Many items such as:

                 <w xml:id="d1e135" xml:lang="mix">
                     <w xml:id="d1e136">tasu</w>
                     <w xml:id="d1e138">lunchi</w>
                  </w>
                  <w xml:id="d1e141" xml:lang="mix">
                     <w xml:id="d1e142">litu</w>
                  </w>
                  <w xml:id="d1e145" xml:lang="mix">
                     <w xml:id="d1e146">litsi</w>
                  </w>
                    ........
                  <linkGrp type="translation">
                     <link target="#d1e135 #d1e141 #d1e145 #d1e149"/>
                  </linkGrp>
                  <spanGrp type="translation">
                     <span target="#d1e135 #d1e141 #d1e145" xml:lang="en">American Kestral</span>
                  </spanGrp>
                  <spanGrp type="semantics">
                     <span type="sense"
                           target="#d1e135 #d1e141 #d1e145"
                           corresp="https://www.wikidata.org/wiki/Q651545"/>
                     <span type="sense"
                           target="#d1e135 #d1e141 #d1e145"
                           corresp="http://dbpedia.org/resource/American_kestrel"/>
.......
                  </spanGrp>

End up with incorrectly merged entries because the @xml:id ("d1e145") of the target for a <w> (i.e. "litsi"):

                  <w xml:id="d1e145" xml:lang="mix">
                     <w xml:id="d1e146">litsi</w>
                  </w>

Is also incorrectly matched when the script sees id's later in the script which have ("d1e145") in their id strings, e.g.

                  <w xml:id="d1e1452" xml:lang="es">
                     <w xml:id="d1e1453">carpintero</w>
                     <w xml:id="d1e1455">mexicano</w>
                  </w>
                  <linkGrp type="translation">
                     <link target="#d1e1440 #d1e1452"/>
                     <link target="#d1e1445 #d1e1452"/>
                  </linkGrp>

and

                  <w xml:id="d1e1459" xml:lang="mix" norm="taka pintu ncha'i">
                     <w xml:id="d1e1460">taka</w>
                     <w xml:id="d1e1462">pintu</w>
                     <w xml:id="d1e1464" orig="nchaꞌi">ncha'i</w>
                  </w>

Thus producing incorrect TEI dictionary entries such as (note the only correct bird name for this should be "American Kestral":

         <entry xml:id="American_kestrel Acorn_woodpecker">
            <form type="lemma">
               <orth xml:lang="mix">litsi</orth>
               ....
           </form>
            <gramGrp>
               <pos>noun</pos>
            </gramGrp>
            <sense corresp="https://www.wikidata.org/wiki/Q651545 http://dbpedia.org/resource/American_kestrel http://dbpedia.org/resource/Acorn_woodpecker">
....
               <cit type="translation">
                  <form>
                     <orth xml:lang="en">American Kestral Acorn woodpecker</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">halcón cernícalo</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero mexicano</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero mexicano</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero arlequín</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="en">American Kestral Acorn woodpecker</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">halcón cernícalo</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero mexicano</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero mexicano</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">carpintero arlequín</orth>
                  </form>
               </cit>
            </sense>
         </entry>

So this is an issue that I understand the problem but can find the right way to make the rule that the script treats $target only as a complete string (I couldn't figure out how to distinguish that the end of the string should be the end.

I of course know regex '$' is probably what I need but I don't know where to put it and how to combine it with what I have...

iljackb commented 5 years ago

Another erroneous byproduct of this is that in the merged TEI dictionary produced, there are 10 entries for "kañuu", but it should only have 2 as per the source document, However, unlike the problem above, there are not multiple occurrences of the $target values for either occurrence of "kañuu", e.g. "d1e545", and "d1e555" do not occur in any longer sequence.. So I'm not sure why this is occurring.

               <item>
                  <graphic url="Aves-35.png"/>
                  <w xml:id="d1e545" xml:lang="mix">
                     <w xml:id="d1e546">kañuu</w>
                  </w>
                  <w xml:id="d1e548" xml:lang="es">
                     <w xml:id="d1e549">codorniz</w>
                     <w xml:id="d1e551">arlequín</w>
                  </w>
                  <linkGrp type="translation">
                     <link target="#1e545 #d1e548"/>
                  </linkGrp>
                  <spanGrp type="translation">
                     <span target="#1e545" xml:lang="en">Montezuma Quail</span>
                  </spanGrp>
                  <spanGrp type="semantics">
                     <span type="sense"
                           target="#1e545"
                           corresp="https://www.wikidata.org/wiki/Q1093509"/>
                     <span type="sense"
                           target="#1e545"
                           corresp="http://dbpedia.org/resource/Montezuma_quail"/>
.....
                  </spanGrp>
               </item>
               <item>
                  <graphic url="Aves-36.png"/>
                  <w xml:id="d1e555" xml:lang="mix">
                     <w xml:id="d1e556">kañuu</w>
                  </w>
                  <w xml:id="d1e558" xml:lang="es">
                     <w xml:id="d1e559">codorniz</w>
                     <w xml:id="d1e561">cotuí</w>
                  </w>
                  <linkGrp type="translation">
                     <link target="#d1e555 #d1e558"/>
                  </linkGrp>
                  <spanGrp type="translation">
                     <span target="#d1e555" xml:lang="en">Northern Bobwhite</span>
                  </spanGrp>
                  <spanGrp type="semantics">
                     <span type="sense"
                           target="#d1e555"
                           corresp="https://www.wikidata.org/wiki/Q142651"/>
                     <span type="sense"
                           target="#d1e555"
                           corresp="http://dbpedia.org/resource/Northern_bobwhite"/>
.....
                  </spanGrp>

The thing that is working correctly is that in 2 of the 10, the distinction is correctly maintained between the different birds that in (though I would eventually merge them into a single entry).

         <entry xml:id="Northern_bobwhite">
            <form type="lemma">
               <orth xml:lang="mix">kañuu</orth>
               <pron xml:lang="mix" notation="ipa"/>
            </form>
            <gramGrp>
               <pos>noun</pos>
            </gramGrp>
            <sense corresp="https://www.wikidata.org/wiki/Q142651 http://dbpedia.org/resource/Northern_bobwhite">
               <usg type="domain" corresp="http://dbpedia.org/resource/Animal">Animal</usg>
               <usg type="domain" corresp="http://dbpedia.org/resource/Bird">Bird</usg>
               <xr type="hyponymOf">
                  <ref corresp="#bird" xml:lang="mix">saa</ref>
                  <ref type="sense" corresp="http://dbpedia.org/resource/Bird"/>
               </xr>
               <cit type="translation">
                  <form>
                     <orth xml:lang="en">Northern Bobwhite</orth>
                  </form>
               </cit>
               <cit type="translation">
                  <form>
                     <orth xml:lang="es">codorniz cotuí</orth>
                  </form>
               </cit>
            </sense>
         </entry>
iljackb commented 5 years ago

Another problem is a discord in the:

  1. number of entries (extracted as separate dictionary files) generated by extraction from script (132);
  2. number of entries in the dictionary generated by merging the separate files (695);

This is strange because the merge script should just merge the single files...