Open albbas opened 4 years ago
Date: 2020-09-01 13:05:19 +0200
From: Linda Wiechetek <
This regards the tokenization of elements of potential compounds. When split, the second part is tokenized with a space in front of it like in this case of "< lávlui>". This causes problems in the evaluation of error detection/correction as the error-markup (which starts at "lávlui" and not " lávlui") and the detected error by the grammarchecker (which starts at the space) do not match.
I think this also caused problems when using the grammarchecker in Word since the space is deleted when correcting the error causing the word to merge with the previous one even though they are not a compound.
echo "Koarra lávlui ja vihaheapmi sáhtii álggahuvvot." | tools/grammarcheckers/modes/trace-smegramrelease-dev.mode | less
"
Date: 2020-09-02 08:19:24 +0200
From: Sjur Nørstebø Moshagen <
This is done by mwe-split, a tool written by Kevin. I believe that he is the best one to solve this.
Ideally I think the output should be as follows after mwe-dis and mwe-split:
input: "Koarra lávlui ..."
output:
"
with the space after : like in the regular tokenisation.
Kevin, wdyt?
Date: 2020-09-02 11:50:06 +0200 From: Kevin Brubeck Unhammer <<unhammer+apertium>>
Previously, the split point has always been after the space ("
mwe-split had code to put that into the between-blanks.
I'm now adding code to do the same if the split point is before the space as well.
Date: 2020-09-02 11:56:52 +0200 From: Kevin Brubeck Unhammer <<unhammer+apertium>>
(commited r14806 to vislcg3 svn; should be in new package tomorrow)
Date: 2020-09-24 14:36:09 +0200
From: Linda Wiechetek <
So it is fixed? Cool! Will try it later today.
I found a new example today (haven't updated vislcg3 yet) if anyone wants to test:
Raporta buktá ovdan daid guovddášgáibádusaid, maid sámi searvvuš ja sámi konfereansat leat buktán ovdan julggaštusaineaset ja sámedikkiid iežaset barggus.
The problem is between sámi and konfereansat.
"<sámi>"
"sápmi" N Sem/Hum_Lang Sg Gen
Date: 2020-09-24 14:40:48 +0200
From: Linda Wiechetek <
Just tested it after getting a new vislcg3 and the previous sentence works!! thanks Kevin!
"<sámi>"
"sápmi" N Sem/Hum_Lang Sg Gen
We just gotta see how the compounding works now.
This issue was created automatically with bugzilla2github
Bugzilla Bug 2672
Date: 2020-09-01T13:05:19+02:00 From: Linda Wiechetek <>
To: Kevin Brubeck Unhammer <<unhammer+apertium>>
CC: lene.antonsen, sjur.n.moshagen, thomas.omma, tommi.pirinen, trond.trosterud
Last updated: 2020-09-24T14:40:48+02:00