giellalt / lang-sme

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Northern Sami language
GNU General Public License v3.0
6 stars 1 forks source link

MSWord moves word to upper line when correcting space error #50

Open duomdaamaendra opened 2 years ago

duomdaamaendra commented 2 years ago

B. Moske (s.25) «Mun in jáme/mu luondu dušše rievdá» Paltto (s.37) «mánát sturrot/mun ieš boarásmuvan» ?? B.Moske (s.39) «Nu jođánit moai rávásmuvaime» … (s.47) /Rumaš goldná dađistaga»

This happens when correcting "B.Moske" to "B. Moske":

Skjermbilde 2022-02-08 kl  19 04 49

The problem occurs because CR(LF) is not escaped in the various tools:

duomdaamaendra commented 2 years ago

this does not happen in Googledocs

lynnda-hill commented 1 year ago

When fixing ?? to ? ? a new suggestions appear, ?B. can be fixed to ? B. However, there is a new line after ? which the program seems to ignore.

snomos commented 10 months ago

It seems that the problem is that we haven't considered CARRIAGE RETURN / Ux000D (\r) in our processing. I assume it should be added to our whitespace analyser.

snomos commented 10 months ago

Soemthing very strange happens that looks like a bug. With the following minimal test text:

boarásmuvan» ?? B.Moske

(copy to MS Word, paste it in a new document, and copy it back from the word file if the CR is lost) I get the foliowing in UnicodeChecker:


CR (U+000D) is clearly located directly after the two question marks, and before the newline.

Now store the test text (with the CR char) in a test file, and run it through the grammar checker:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease.mod

The result is this:

    "boarásmuvvat" Err/Orth-a-á <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
    "boarásmuvvat" v1 <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> &punct-aistton-right &space-before-punct-mark &LINK ID:2 R:LEFT:1
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> "boarásmuvan”"S &punct-aistton-right &SUGGESTWF ID:2 R:LEFT:1
    "”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide> &LINK &space-before-punct-mark ID:2 R:LEFT:1
    "?" CLB <W:0.0> <SpaceBeforePunctMark>

    "?" CLB <W:0.0> <NoSpaceAfterPunctMark> &no-space-after-punct-mark ID:5 R:RIGHT:7
    "?" CLB <W:0.0> <NoSpaceAfterPunctMark> "? B."S &no-space-after-punct-mark &SUGGESTWF ID:5 R:RIGHT:7

    "B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark &LINK ID:7
    "Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark &LINK ID:7
    "Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> @HNOUN

Suddenly the CR char (and the newline) is placed before the two question marks.

That is, the character stream has been changed somewhere in the processing. That should not happen.

snomos commented 10 months ago

The tokeniser/analyser is fine:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease0-morph.mode
    "boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0>
    "boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0>
    "»" PUNCT RIGHT <W:0.0>
    "”" PUNCT RIGHT Err/Orth <W:0.0>
    "?" CLB <W:0.0>
    "?" CLB <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
    "." CLB <W:0.0> "<.>"
        "b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
    "." CLB <W:0.0> "<.>"
        "b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
    "Moske" N Prop Sem/Plc Attr <W:0.0>
    "Moske" N Prop Sem/Plc Sg Nom <W:0.0>
snomos commented 10 months ago

The first whitespace analyser moves the chars one place:

cat test.txt | ./tools/grammarcheckers/modes/smegramrelease1-blanktag.mode
    "boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
    "boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
    "”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
    "?" CLB <W:0.0>
    "?" CLB <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
    "." CLB <W:0.0> "<.>"
        "B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
    "." CLB <W:0.0> "<.>"
        "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
    "." CLB <W:0.0> "<.>"
        "b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
    "." CLB <W:0.0> "<.>"
        "b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
    "Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
    "Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort>
snomos commented 10 months ago

And then they are moved another time by the second whitespace analyser:

    "boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
    "boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
    "”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
    "?" CLB <W:0.0> <NoSpaceAfterPunctMark> <SpaceBeforePunctMark>
    "?" CLB <W:0.0> <NoSpaceAfterPunctMark>
    "B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
    "B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
    "Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>

    "Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
    "Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort>

So something is clearly wrong in the whitespace analysers.

snomos commented 10 months ago

I tried fixing the regex to open up for CR in but that did not help. Could you have a look, @unhammer ?

unhammer commented 10 months ago

This is not fine. That should probably be


which would mean a newline occurred. There should be an initial colon before any lines with unanalysed data. Anything without an initial colon/tab/quote is ignored by divvun-suggest.

got to fix this in hfst-tokenise and divvun-suggest

flammie commented 1 month ago

does this work correctly now? I get:

$ cat ~/github/divvun/libdivvun/foo | ~/github/hfst/hfst/tools/src/hfst-tokenize -g tools/grammarcheckers/tokeniser-gramcheck-gt-desc.pmhfst | divvun-blanktag tools/grammarcheckers/analyser-gt-whitespace.hfst | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/valency.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/mwe-dis.bin'  | cg-mwesplit  | divvun-blanktag '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/analyser-gt-errorwhitespace.hfst' | divvun-cgspell -n 10 -b 15.000000 -w 5000.000000 -u 0.400000 -l '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/acceptor.default.hfst' -m '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/errmodel.default.hfst'  | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/valency-postspell.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/grc-disambiguator.bin'  | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/spellchecker.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/grammarchecker-release.bin'  | divvun-suggest -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/generator-gramcheck-gt-norm.hfstol' -m '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/errors.xml' -l se 
    "boarásmuvvat" v1 <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> &punct-aistton-right &space-before-punct-mark &LINK ID:2 R:LEFT:1
    "»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> "boarásmuvan”"S &punct-aistton-right &SUGGESTWF ID:2 R:LEFT:1
    "”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide> &LINK &space-before-punct-mark ID:2 R:LEFT:1
    "?" CLB <W:0.0> <SpaceBeforePunctMark>

    "?" CLB <W:0.0> <LastCohortOfParagraph>

    "B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark ID:7 R:RIGHT:8
    "B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN "B. Moske"S &no-space-after-punct-mark &SUGGESTWF ID:7 R:RIGHT:8
    "Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark ID:7 R:RIGHT:8
    "Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN "B. Moske"S &no-space-after-punct-mark &SUGGESTWF ID:7 R:RIGHT:8
    "Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> @HNOUN &LINK &no-space-after-punct-mark ID:8
$ xxd ~/github/divvun/libdivvun/foo 
00000000: 626f 6172 c3a1 736d 7576 616e c2bb 203f  boar..smuvan.. ?
00000010: 3f0d 0a42 2e4d 6f73 6b65 0d0a            ?..B.Moske..