giellalt / shared-mul

Shared multilingual linguistic resources
GNU General Public License v3.0
2 stars 1 forks source link

telefonnr-analysator for alle språk #2

Open ilm024 opened 2 months ago

ilm024 commented 2 months ago

Vi mangler en telefonr-analusator for alle språk. Enten i shared-smi elelr shared-mul.

Nå ser det slik ut i lulesamisk, og der blir svenske telefonnr særlig utfordrende da disse får blir "typos" da de begynner med 0:

"<tel.>"
        "tel" N <smj> <smj> Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> SELECT:3805 SUBSTITUTE:4355 SUBSTITUTE:4354
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> SELECT:3805
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> SELECT:3805
;       "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> REMOVE:3661
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>" REMOVE:2110:longest-match
;       "." CLB <W:0.0> "<.>"
;               "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Pl Nom <W:0.0> "<tel>" REMOVE:2110:longest-match
: 
"<073-786>"             073-786 →  -73-786      →  73-786
        "-73-786" Num Arab Sg Nom <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 ADD:6:spelled SELECT:1455 &SUGGESTWF &typo
typo
        "73-786" Num Arab Sg Nom <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 ADD:6:spelled SELECT:1455 &SUGGESTWF &typo
typo
;       "-73-786" Num Arab Sg Ela Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Gen <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Ine Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "-73-786" Num Arab Sg Ill Attr <W:32.5909> <WA:22.5909> <spelled> "-73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ela Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Gen <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ine Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "73-786" Num Arab Sg Ill Attr <W:32.5909> <WA:22.5909> <spelled> "73-786"S PROTECT:1251 SELECT:1301 &SUGGESTWF &typo ADD:6:spelled SELECT:1455
;       "073-786" ? SELECT:1301
: 
"<58>"
        "58" Num Arab Sg Nom <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ela Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Gen <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ill Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Arab Sg Ine Attr <W:0.0> SELECT:1454:Arab SELECT:1456
;       "58" Num Sem/ID <W:0.0> SELECT:1454:Arab
;       "58" A Arab Ord Attr CLBfinal <W:0.0> REMOVE:2067:spurious-adj-reading
: 
"<10.>"
        "10" A <smj> <smj> Arab Ord Attr <W:0.0> SUBSTITUTE:4354 SUBSTITUTE:4353
snomos commented 2 months ago

Fyrste del av telefonnummeret blir rett og slett ikkje kjent igjen av analysatoren, slik at det er stavekontrollen som blir brukt til å generera "retteforslag, jf <spelled>.

flammie commented 2 months ago

teknisk er det ganske enkelt å laga lexicon eller regulære uttrykk av telefonnummerformata, største problem har vart jo at i shared det blir problematisk for en eller annet bruk, til eksempel, det finnes allerede ukommentert telefonnummerleksikon i shared-smi: https://github.com/giellalt/shared-smi/blob/main/src/fst/stems/arabic_roman_digits.lexc#L354-L368, (den er for gammelt for att æ kunne finne ut kem som har utkomentert den men kanskje det er noen som vet bakgrunn til det her?)

snomos commented 2 months ago

teknisk er det ganske enkelt å laga lexicon eller regulære uttrykk av telefonnummerformata, største problem har vart jo at i shared det blir problematisk for en eller annet bruk

Det er berre å ignorera utkommenterte, gamle ting. Vi treng ein felles telefonnummerparsar, så om du kan leggja til ein i shared-mul hadde det vore kjempefint.

Og så må telefonnumra sjølvsagt taggast slik at det er lett å disambiguera dei, eller heilt fjerna dei frå fst-en.

flammie commented 2 months ago

den er i shared-mul og lang-smj nå:

$ echo tel. 073-786 58 10 | hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<tel.>"
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>"
    "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0>
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0> "<tel>"
    "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Acc <W:0.0>
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>"
    "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0>
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>"
    "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0>
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Nom <W:0.0> "<tel>"
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Sg Gen <W:0.0> "<tel>"
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Pl Nom <W:0.0> "<tel>"
    "." CLB <W:0.0> "<.>"
        "tel" N Sem/Obj-el ABBR Gram/TNumAbbr Attr <W:0.0> "<tel>"
: 
"<073-786 58 10>"
    "073-786 58 10" Num Arab TEL <W:0.0>
:\n