hipster-philology / nlp-pie-taggers

Extension for pie to include taggers with their models and pre/postprocessors
Mozilla Public License 2.0
11 stars 3 forks source link

Normalize input for the lasla model #29

Closed PonteIneptique closed 3 years ago

PonteIneptique commented 4 years ago

I have seen diacritics on letters like for stress and what not. This is not good for our model, so basically I'd go the unidecode route ?

PonteIneptique commented 3 years ago

Current implementation is bugged because it happens before throwing out the Greek, resulting in

<div type="fragment" corresp="adams:107" ana="#anus #culus">
  <bibl><author>Macrobius</author>, <title>Saturnales</title>, <biblScope>1.18-1.18</biblScope></bibl>
  <quote xml:lang="lat" source="urn:cts:latinLit:stoa0186.stoa001.thayer-lat1:1.18-1.18" type="chapter">
    <w ref="1.18" lemma="," pos="PUNC" msd="MORPH=empty">,</w>
    <w ref="1.18" lemma="augeo" pos="VER" msd="Numb=Sing|Mood=Imp|Tense=Pres|Voice=Act|Person=2">αὐγὴ</w>
    <w ref="1.18" lemma="des" pos="NOMcom" msd="Case=Acc|Numb=Sing">δ</w>
    <w ref="1.18" lemma="’" pos="PUNC" msd="MORPH=empty">’</w>
    <w ref="1.18" lemma="aspet" pos="ADJqua" msd="Case=Acc|Numb=Plur|Gend=Masc">ἄσπετος</w>
    <w ref="1.18" lemma="ex" pos="PRE" msd="MORPH=empty">ᾖ</w>
    <w ref="1.18" lemma="," pos="PUNC" msd="MORPH=empty">,</w>
    <w ref="1.18" lemma="ana" pos="NOMcom" msd="Case=Abl|Numb=Sing">ἀνὰ</w>
    <w ref="1.18" lemma="de" pos="PRE" msd="MORPH=empty">δὲ</w>
    <w ref="1.18" lemma="drosus" pos="NOMcom" msd="Case=Abl|Numb=Sing">δρόσῳ</w>
    <w ref="1.18" lemma="amphimigo" pos="ADJqua" msd="Case=Abl|Numb=Plur|Gend=Fem">ἀμφιμιγεῖσα</w>
    <w ref="1.18" lemma="marmo" pos="VER" msd="Mood=Imp|Tense=Pres|Voice=Dep">μαρμαίρῆ</w>
    <w ref="1.18" lemma="dinesinus" pos="VER" msd="Numb=Sing">δίνῆσιν</w>
    <w ref="1.18" lemma="elissomen" pos="NOMcom" msd="Case=Abl|Numb=Sing">ἑλισσομένη</w>
    <w ref="1.18" lemma="catus" pos="ADJqua" msd="Case=Nom|Numb=Sing|Gend=Fem">κατὰ</w>
    <w ref="1.18" lemma="puclo" pos="NOMcom" msd="Case=Acc|Numb=Sing">κύκλον</w>
    <w ref="1.18" lemma="," pos="PUNC" msd="MORPH=empty">,</w>
    <w ref="1.18" lemma="prosthe" pos="ADV" msd="Deg=Pos">πρόσθε</w>
    <w ref="1.18" lemma="theuus" pos="ADJqua" msd="Case=Abl|Numb=Sing|Gend=MascNeut">θεοῦ</w>
    <w ref="1.18" lemma="·" pos="PUNC" msd="MORPH=empty">·</w>
    <w ref="1.18" lemma="Soster" pos="NOMcom" msd="Case=Nom|Numb=Sing">ζωστὴρ</w>
    <w ref="1.18" lemma="des" pos="NOMcom" msd="Case=Nom|Numb=Sing">δ</w>
    <w ref="1.18" lemma="’" pos="PUNC" msd="MORPH=empty">’</w>
    <w ref="1.18" lemma="ar" pos="VER" msd="Numb=Sing">ἄρ</w>
    <w ref="1.18" lemma="’" pos="PUNC" msd="MORPH=empty">’</w>
    <w ref="1.18" lemma="upor" pos="NOMcom" msd="Case=Abl|Numb=Sing">ὑπὸ</w>
    <w ref="1.18" lemma="sternon" pos="NOMcom" msd="Case=Voc|Numb=Sing">στέρνων</w>
    <w ref="1.18" lemma="ametretus" pos="ADJqua" msd="Case=Acc|Numb=Sing|Gend=Masc|Deg=Pos">ἀμετρήτων</w>
    <w ref="1.18" lemma="phainetae" pos="NOMcom" msd="Case=Gen|Numb=Sing">φαίνεται</w>
    <w ref="1.18" lemma="oceanus1" pos="NOMcom" msd="Case=Acc|Numb=Sing">ὠκεανοῦ</w>
    <w ana="#anus #culus" ref="1.18" lemma="culus" pos="NOMcom" msd="Case=Acc|Numb=Plur">κύκλος</w>
    <w ref="1.18" lemma="," pos="PUNC" msd="MORPH=empty">,</w>
    <w ref="1.18" lemma="mega" pos="NOMcom" msd="Case=Nom|Numb=Sing">μέγα</w>
    <w ref="1.18" lemma="thauma" pos="NOMcom" msd="Case=Nom|Numb=Sing">θαῦμα</w>
    <w ref="1.18" lemma="idestelus" pos="NOMcom" msd="Case=Gen|Numb=Sing">ἰδέσθαι</w>
    <w ref="1.18" lemma="." pos="PUNC" msd="MORPH=empty">.</w>
  </quote>
</div>
PonteIneptique commented 3 years ago

Should have been fixed in 0.0.27