Closed lynnda-hill closed 2 years ago
echo 'Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.' | tools/grammarcheckers/modes/smegramrelease-dev.mode
"<Ságastallan>"
"ságastallan" N <NomGenSg> Sem/Act Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
"ságastallat" Ex/V TV Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
"ságastit" Ex/V TV Der/alla Ex/V Der/NomAct N <NomGenSg> Sg Nom <W:0.0> <firstCohort> @SUBJ> #1->1
:
"<lea>"
"leat" <mv> V <copula> <TH-Nom-Any> <mielde> <OR-Loc-HumGroup> <OR-eret-Plc> <dušše><TH-Inf> <árvvus> <LO-Loc-johtu><DE-Ill-Plc> <AT-Loc-Mat> <AT-Abe-Any> <AT-Nom-Any> <AT-Nom-Adj><EX-Ill-Ani> <PO-Loc-Hum> <PO-Gen-Hum> <MA-mielde-Any> <MA-Adv-Manner> <XT-Gen-Measr> <LO-maŋŋil-Time> <LO-Acc-Time> <LO-Loc-Time> <CO-Com-Ani> <ID-Nom-Any> <TH-Nom-Any><RO-Ess-Any><EX-Ill-Any> <EX-Ill-Ani><TH-Nom-Adj> <EX-Ill-Ani> <TH-Nom-Obj><RE-Ill-Ani> <LO-Loc-Any> <AktioEss> <BE-Ill-Ani><PU-Ess-Any> <RO-Ess-Any><PU-Ill-Act> <RO-Ess-Any> <Inf> IV Ind Prs Sg3 <W:0.0> @+FMAINV #2->2
:
"<rabas>"
"rabas" A Sem/Hum Attr <W:0.0> @>N #3->3
"rabas" Adv <W:0.0> @<ADVL #3->3
:
"<neahtas>"
"neahtta" N Sem/Dummytag Sg Loc <W:0.0> @<ADVL #4->4
:
"<7.-11.5.>"
"7.-11.5" Num Sem/Date Sg Gen <W:0.0> <NoSpaceAfterPunctMark> @>N #5->5
"<,>"
"," CLB <W:0.0> #6->6
:
"<muhto>"
"muhto" CC <W:0.0> @CVP #7->7
:
"<dárbbu>"
"dárbu" N <TH-Inf> <TH-Ill-Any> Sem/Perc-phys Sg Gen <W:0.0> @>P #8->9
:
"<mielde>"
"mielde" Po <W:0.0> @ADVL> #9->9
:
"<ságastallanáiggi>"
"ságastallanáigi" N Sem/Time Sg Acc <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
"ságastallanáigi" N Sem/Time Sg Gen <W:0.0> <cohort-with-dynamic-compound> @ADVL> #10->10
:
"<sáhttá>"
"sáhttit" <aux> V <TH-Acc-Obj><XT-Acc-Measure> <DE-Ill-Plc> <Inf> IV Ind Prs Sg3 <W:0.0> @+FAUXV #11->11
:
"<guhkidit>"
"guhkidit" <mv> V <TH-Acc-Any><SO-Loc-Any><DE-Ill-Any> <TH-Acc-Any><DE-Ill-*Ani> <PA-Acc-Any><XT-Com-Measure> <PA-Acc-Any> TV Inf <W:0.0> @-FMAINV #12->12
"<.>"
"." CLB <W:0.0> <LastCohort> #13->13
:\n
The issue was that date regexp had obligatory leading 0 for dates < 10... this may cause some extra ambiguity.
In the following example we want to tokenize "7.-11.5." as a date (range).
Ságastallan lea rabas neahtas 7.-11.5., muhto dárbbu mielde ságastallanáiggi sáhttá guhkidit.
Instead it is tokenized in the following way:
"<7.-11.5>" "7.-11.5" Num Sem/ID #5->5
"<.>"
"." CLB &no-space-after-punct-mark #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct ADD:9780:no-space-after-punct
no-space-after-punct-mark
"." CLB ". ,"S &no-space-after-punct-mark &SUGGESTWF #6->6 ID:6 R:RIGHT:8 ADD:9780:no-space-after-punct COPY:9797:no-space-after-punct-sugg
no-space-after-punct-mark
"<,>" "," CLB &no-space-after-punct-mark #1->1 ID:8 ADD:9790:no-space-after-punct-link ADD:9790:no-space-after-punct-link
no-space-after-punct-mark
"," CLB &LINK #1->1 ID:8 ADD:9790:no-space-after-punct-link ADDRELATION(RIGHT):9795:no-space-after-punct-rel ADD:9790:no-space-after-punct-link
:
The problem is that since "." is not part of the date, it is tokenized as a sentence boundary.