giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
GNU General Public License v3.0
6 stars 1 forks source link

tokenization: space before second elements of compounds causes problems for grammar checking (Bugzilla Bug 2672) #449

Open albbas opened 4 years ago

albbas commented 4 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 2672

Date: 2020-09-01T13:05:19+02:00 From: Linda Wiechetek <> To: Kevin Brubeck Unhammer <<unhammer+apertium>> CC: lene.antonsen, sjur.n.moshagen, thomas.omma, tommi.pirinen, trond.trosterud

Last updated: 2020-09-24T14:40:48+02:00

albbas commented 4 years ago

Comment 13969

Date: 2020-09-01 13:05:19 +0200 From: Linda Wiechetek <>

This regards the tokenization of elements of potential compounds. When split, the second part is tokenized with a space in front of it like in this case of "< lávlui>". This causes problems in the evaluation of error detection/correction as the error-markup (which starts at "lávlui" and not " lávlui") and the detected error by the grammarchecker (which starts at the space) do not match.

I think this also caused problems when using the grammarchecker in Word since the space is deleted when correcting the error causing the word to merge with the previous one even though they are not a compound.

echo "Koarra lávlui ja vihaheapmi sáhtii álggahuvvot." | tools/grammarcheckers/modes/trace-smegramrelease-dev.mode | less

"" "koarra" N Sem/Group_Hum_Org Sg Nom @S UBJ> MAP:23567 "< lávlui>" "lávlut" V <RE-Ill-A ni> TV Ind Prt Sg3 Err/Confused-NomAgIll SUBSTITUTE:2460 SUB STITUTE:2507 SUBSTITUTE:2675 SUBSTITUTE:3505 SUBSTITUTE:4134 SUBSTIT UTE:4137 @+FMAINV MAP:16050:r406 SUBSTITUTE:24637:muitalit &real-Nom AgIll-PrtSg3 SUBSTITUTE:3957:SubV=mv ADD:6085:real-NomAgIll-PrtSg3 A DD:6091:real-NomAgIll-PrtSg3 ADD:6085:real-NomAgIll-PrtSg3 ADD:6091: real-NomAgIll-PrtSg3 real-NomAgIll-PrtSg3

albbas commented 4 years ago

Comment 13970

Date: 2020-09-02 08:19:24 +0200 From: Sjur Nørstebø Moshagen <>

This is done by mwe-split, a tool written by Kevin. I believe that he is the best one to solve this.

Ideally I think the output should be as follows after mwe-dis and mwe-split:

input: "Koarra lávlui ..."


"" "koarra" N Sem/Group_Hum_Org Sg Nom @SUBJ> : "<lávlui>" "lávlut" V

with the space after : like in the regular tokenisation.

Kevin, wdyt?

albbas commented 4 years ago

Comment 13971

Date: 2020-09-02 11:50:06 +0200 From: Kevin Brubeck Unhammer <<unhammer+apertium>>

Previously, the split point has always been after the space ("" "").

mwe-split had code to put that into the between-blanks.

I'm now adding code to do the same if the split point is before the space as well.

albbas commented 4 years ago

Comment 13972

Date: 2020-09-02 11:56:52 +0200 From: Kevin Brubeck Unhammer <<unhammer+apertium>>

(commited r14806 to vislcg3 svn; should be in new package tomorrow)

albbas commented 3 years ago

Comment 14021

Date: 2020-09-24 14:36:09 +0200 From: Linda Wiechetek <>

So it is fixed? Cool! Will try it later today.

I found a new example today (haven't updated vislcg3 yet) if anyone wants to test:

Raporta buktá ovdan daid guovddášgáibádusaid, maid sámi searvvuš ja sámi konfereansat leat buktán ovdan julggaštusaineaset ja sámedikkiid iežaset barggus.

The problem is between sámi and konfereansat.

"<sámi>" "sápmi" N Sem/Hum_Lang Sg Gen @>N MAP:22549: r227 giellalt/bugzilla-dummy#11->11 "sápmi" N Err/Orth Sem/Hum_Lang Sg Gen @>N MAP:22549:r227 giellalt/bugzilla-dummy#11->11 ; "sápmi" N Sem/Hum_Lang Sg Acc REMOVE:17613:r2255 ; "sápmi" N Err/Orth Sem/Hum_Lang Sg Acc REMOVE:17613:r2255 "< konfereansat>" "konferánsa" v2 N Sem/Event Sg Nom PxSg2 @SUBJ> MAP:23149:r3314 &real-PlNomPxSg2-PlNom giellalt/bugzilla-dummy#12->12 ADD:6286:real-PlNomPxSg2-PlNom ADD:6310:real-PlNomPxSg2-PlNom real-PlNomPxSg2-PlNom "konferánsa" v2 N Pl Sem/Event Nom @SUBJ> MAP:23149:r3314 &SUGGEST giellalt/bugzilla-dummy#12->12 ADD:6286:real-PlNomPxSg2-PlNom ADD:6310:real-PlNomPxSg2-PlNom COPY:6319:real-PlNomPxSg2-PlNom konferánsa+v2+N+Pl+Nom konfereanssat ; "sámekonferánsa" N Err/Orth Sem/Event Sg Nom PxSg2 Err/SpaceCmp REMOVE:2264:GenFirst

albbas commented 3 years ago

Comment 14022

Date: 2020-09-24 14:40:48 +0200 From: Linda Wiechetek <>

Just tested it after getting a new vislcg3 and the previous sentence works!! thanks Kevin!

"<sámi>" "sápmi" N Sem/Hum_Lang Sg Gen @>N MAP:22549: r227 giellalt/bugzilla-dummy#11->11 "sápmi" N Err/Orth Sem/Hum_Lang Sg Gen @>N M AP:22549:r227 giellalt/bugzilla-dummy#11->11 ; "sápmi" N Sem/Hum_Lang Sg Acc REMOVE:17613:r 2255 ; "sápmi" N Err/Orth Sem/Hum_Lang Sg Acc REMOV E:17613:r2255 : "" "konferánsa" v2 N Sem/Event Sg Nom PxSg2 @SUBJ> MAP:23149:r3314 &real-PlNomPxSg2-PlNom giellalt/bugzilla-dummy#12->12 ADD:6286:real-PlNomPxSg2-PlNom ADD:6310:real-PlNomPxSg2-PlNom real-PlNomPxSg2-PlNom "konferánsa" v2 N Pl Sem/Event Nom @SUBJ> MAP:23149:r3314 &SUGGEST giellalt/bugzilla-dummy#12->12 ADD:6286:real-PlNomPxSg2-PlNom ADD:6310:real-PlNomPxSg2-PlNom COPY:6319:real-PlNomPxSg2-PlNom konferánsa+v2+N+Pl+Nom konfereanssat ; "sámekonferánsa" N Err/Orth Sem/Event Sg Nom PxSg2 Err/SpaceCmp REMOVE:2264:GenFirst :

We just gotta see how the compounding works now.