giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
https://giellalt.uit.no
GNU General Public License v3.0
6 stars 1 forks source link

wrong tokenization of date ranges #32

Closed lynnda-hill closed 2 years ago

lynnda-hill commented 3 years ago

In the following example we want to tokenize "7.-11.5." as a date (range).

Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.

Instead it is tokenized in the following way:

"<7.-11.5>" "7.-11.5" Num Sem/ID #5->5 "<.>" "." CLB &no-space-after-punct-mark #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct ADD:9780:no-space-after-punct no-space-after-punct-mark "." CLB ". ,"S &no-space-after-punct-mark &SUGGESTWF #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct COPY:9797:no-space-after-punct-sugg no-space-after-punct-mark

"<,>" "," CLB &no-space-after-punct-mark #1->1 ID:8 ADD:9790:no-space-after-punct-link ADD:9790:no-space-after-punct-link no-space-after-punct-mark "," CLB &LINK #1->1 ID:8 ADD:9790:no-space-after-punct-link ADDRELATION(RIGHT):9795:no-space-after-punct-rel ADD:9790:no-space-after-punct-link :

The problem is that since "." is not part of the date, it is tokenized as a sentence boundary.

flammie commented 2 years ago
echo 'Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.' | tools/grammarcheckers/modes/smegramrelease-dev.mode 
"<Ságastallan>"
    "ságastallan" N <NomGenSg> Sem/Act Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
    "ságastallat" Ex/V TV Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
    "ságastit" Ex/V TV Der/alla Ex/V Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
: 
"<lea>"
    "leat" <mv> V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @+FMAINV #2->2
: 
"<rabas>"
    "rabas" A Sem/Hum Attr <W:0.0> @>N #3->3
    "rabas" Adv <W:0.0> @<ADVL #3->3
: 
"<neahtas>"
    "neahtta" N Sem/Dummytag Sg Loc <W:0.0> @<ADVL #4->4
: 
"<7.-11.5.>"
    "7.-11.5" Num Sem/Date Sg Gen <W:0.0> <NoSpaceAfterPunctMark> @>N #5->5
"<,>"
    "," CLB <W:0.0> #6->6
: 
"<muhto>"
    "muhto" CC <W:0.0> @CVP #7->7
: 
"<dárbbu>"
    "dárbu" N <TH-Inf> <TH-Ill-Any> Sem/Perc-phys Sg Gen <W:0.0> @>P #8->9
: 
"<mielde>"
    "mielde" Po <W:0.0> @ADVL> #9->9
: 
"<ságastallanáiggi>"
    "ságastallanáigi" N Sem/Time Sg Acc <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
    "ságastallanáigi" N Sem/Time Sg Gen <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
: 
"<sáhttá>"
    "sáhttit" <aux> V <TH-Acc-Obj><XT-Acc-Measure> <DE-Ill-Plc> <Inf> IV Ind Prs Sg3 <W:0.0> @+FAUXV #11->11
: 
"<guhkidit>"
    "guhkidit" <mv> V <TH-Acc-Any><SO-Loc-Any><DE-Ill-Any> <TH-Acc-Any><DE-Ill-*Ani> <PA-Acc-Any><XT-Com-Measure> <PA-Acc-Any> TV Inf <W:0.0> @-FMAINV #12->12
"<.>"
    "." CLB <W:0.0> <LastCohort> #13->13
:\n

The issue was that date regexp had obligatory leading 0 for dates < 10... this may cause some extra ambiguity.