giellalt / lang-mns

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Mansi language
https://giellalt.uit.no
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Speller suggestion issue #3

Open snomos opened 5 months ago

snomos commented 5 months ago

@Trondtr has reported:

echo а̄им | hfst-ospell -S tools/spellcheckers/mns.zhfst
"а̄им" is NOT in the lexicon:
Corrections for "а̄им":
вим    20.056900
аим    20.787788
йим    21.743298
мим    22.254124
сым    22.524096
оим    22.659590

Now compare this with the following:

grep а tools/spellcheckers/strings.default.txt
а:а̄    1
#а̄:я   4
а̄:а    1
ся:ща   2
Ся:Ща   2

Why is ами suggested as second?

The core of the issue is that а̄ has a base char + combining macron: how well / bad does the error model handle combining diacritics when it comes to suggestions?

snomos commented 5 months ago

The interesting thing happens when you use divvunspell instead of hfst-ospell:

echo а̄им | divvunspell suggest -a tools/spellcheckers/mns.zhfst
Reading from stdin...
Input: а̄им     [INCORRECT]
аим     17.787788
а̄тим       24.199446
а̄гим       25.03525
аким        26.155512
ам      27.999014
агим        28.65959
атым        32.826374
аюм     34.254124
аи      43.7433
аис     43.966442

For comparison, hfst-ospell-office gives the same output as hfst-ospell:

echo '5 а̄им' | hfst-ospell-office -d tools/spellcheckers/mns.zhfst
@@ Loading tools/spellcheckers/mns.zhfst with args max-weight=-1.00, beam=-1.00, time-cutoff=6.00
@@ hfst-ospell-office is alive
&   вим (20.06;0)   аим (20.79;0)   йим (21.74;0)   мим (22.25;0)   сым (22.52;0)
snomos commented 5 months ago

@flammie do you have comments or insights re combining diacritics and hfst-ospell?

In any case: I am not sure how much time we should spend on this, given that it works correct using divvunspell, and divvunspell is used everywhere except in the grammar checker — and from the summer also in the grammar checker. That is, hfst-ospell is not first priority.

flammie commented 5 months ago

yeah it seems quite fragile here at compilation of separate error model part already:

$ echo 'а̄' | hfst-lookup .generated/strings.all.default.hfst 
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> а̄    а̄̄ 1,000000
а̄  а   2,000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а
0   0   а   а   0.000000
1   21  а   а   0.000000
2   43  я   а   0.000000
41  44  я   а   0.000000
58  58  а   а   0.000000
$  hfst-fst2txt .generated/strings.all.default.hfst | fgrep а̄
$  hfst-fst2txt .generated/strings.all.default.hfst | fgrep $'\u0304'
0   0           0.000000
7   8           0.000000
13  14  @0@     0.000000
13  27      @0@ 0.000000
15  16  @0@     0.000000
15  28      @0@ 0.000000
17  18  @0@     0.000000
17  29      @0@ 0.000000
19  20  @0@     0.000000
19  30      @0@ 0.000000
21  22  @0@     0.000000
21  31      @0@ 0.000000
23  24  @0@     0.000000
23  32      @0@ 0.000000
25  26  @0@     0.000000
25  33      @0@ 0.000000
58  58          0.000000
$  hfst-fst2txt .generated/strings.all.default.hfst
0   0   @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
0   0           0.000000
0   0   С   С   0.000000
0   0   Щ   Щ   0.000000
0   0   а   а   0.000000
0   0   г   г   0.000000
0   0   е   е   0.000000
0   0   и   и   0.000000
0   0   й   й   0.000000
0   0   к   к   0.000000
0   0   н   н   0.000000
0   0   о   о   0.000000
0   0   р   р   0.000000
0   0   с   с   0.000000
0   0   т   т   0.000000
0   0   у   у   0.000000
0   0   щ   щ   0.000000
0   0   ы   ы   0.000000
0   0   ь   ь   0.000000
0   0   э   э   0.000000
0   0   ю   ю   0.000000
0   0   я   я   0.000000
0   0   ӈ   ӈ   0.000000
0   0   ӣ   ӣ   0.000000
0   0   ӯ   ӯ   0.000000
0   1   @0@ @0@ 0.000000
1   2   с   щ   0.000000
1   4   т   к   0.000000
1   9   т   т   0.000000
1   13  я   я   0.000000
1   15  э   э   0.000000
1   17  ы   ы   0.000000
1   19  ю   ю   0.000000
1   21  а   а   0.000000
1   23  е   е   0.000000
1   25  о   о   0.000000
1   34  н   ӈ   0.000000
1   36  ӈ   н   0.000000
1   38  г   ӈ   0.000000
1   41  С   Щ   0.000000
1   45  н   н   0.000000
1   48  ӯ   ӯ   0.000000
1   51  у   у   0.000000
2   3   ь   @0@ 0.000000
2   40  ю   у   0.000000
2   43  я   а   0.000000
3   58  @0@ @0@ 1.000000
4   5   и   и   0.000000
4   6   ӣ   ӣ   0.000000
4   7   е   е   0.000000
5   58  @0@ @0@ 2.000000
6   58  @0@ @0@ 2.000000
7   8           0.000000
7   58  @0@ @0@ 2.000000
8   58  @0@ @0@ 2.000000
9   10  т   к   0.000000
10  11  е   е   0.000000
10  12  и   и   0.000000
11  58  @0@ @0@ 2.000000
12  58  @0@ @0@ 2.000000
13  14  @0@     0.000000
13  27      @0@ 0.000000
14  58  @0@ @0@ 1.000000
15  16  @0@     0.000000
15  28      @0@ 0.000000
16  58  @0@ @0@ 1.000000
17  18  @0@     0.000000
17  29      @0@ 0.000000
18  58  @0@ @0@ 1.000000
19  20  @0@     0.000000
19  30      @0@ 0.000000
20  58  @0@ @0@ 1.000000
21  22  @0@     0.000000
21  31      @0@ 0.000000
22  58  @0@ @0@ 1.000000
23  24  @0@     0.000000
23  32      @0@ 0.000000
24  58  @0@ @0@ 1.000000
25  26  @0@     0.000000
25  33      @0@ 0.000000
26  58  @0@ @0@ 1.000000
27  58  @0@ @0@ 2.000000
28  58  @0@ @0@ 2.000000
29  58  @0@ @0@ 2.000000
30  58  @0@ @0@ 2.000000
31  58  @0@ @0@ 2.000000
32  58  @0@ @0@ 2.000000
33  58  @0@ @0@ 2.000000
34  35  г   @0@ 0.000000
35  58  @0@ @0@ 2.000000
36  37  @0@ г   0.000000
37  58  @0@ @0@ 2.000000
38  39  н   н   0.000000
39  58  @0@ @0@ 3.000000
40  58  @0@ @0@ 2.000000
41  42  ю   у   0.000000
41  44  я   а   0.000000
42  58  @0@ @0@ 2.000000
43  58  @0@ @0@ 2.000000
44  58  @0@ @0@ 2.000000
45  46  т   р   0.000000
46  47  р   @0@ 0.000000
47  58  @0@ @0@ 4.000000
48  49  й   и   0.000000
48  54  й   ы   0.000000
49  50  и   @0@ 0.000000
50  58  @0@ @0@ 4.000000
51  52  й   и   0.000000
51  56  й   ы   0.000000
52  53  и   @0@ 0.000000
53  58  @0@ @0@ 4.000000
54  55  ы   @0@ 0.000000
55  58  @0@ @0@ 4.000000
56  57  ы   @0@ 0.000000
57  58  @0@ @0@ 4.000000
58  58  @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
58  58          0.000000
58  58  С   С   0.000000
58  58  Щ   Щ   0.000000
58  58  а   а   0.000000
58  58  г   г   0.000000
58  58  е   е   0.000000
58  58  и   и   0.000000
58  58  й   й   0.000000
58  58  к   к   0.000000
58  58  н   н   0.000000
58  58  о   о   0.000000
58  58  р   р   0.000000
58  58  с   с   0.000000
58  58  т   т   0.000000
58  58  у   у   0.000000
58  58  щ   щ   0.000000
58  58  ы   ы   0.000000
58  58  ь   ь   0.000000
58  58  э   э   0.000000
58  58  ю   ю   0.000000
58  58  я   я   0.000000
58  58  ӈ   ӈ   0.000000
58  58  ӣ   ӣ   0.000000
58  58  ӯ   ӯ   0.000000
58  0.000000
snomos commented 5 months ago

Do I read the above correct, @flammie, when I find no occurrences of а̄ in the ATT version of strings.all.default.hfst? So the base char + combining diacritic is lost during compilation?

If so, how can we force such a sequence to be treated as one symbol, in all contexts? The lexical FST does treat them as one symbol (as opposed to the tokeniser FST, which does the opposite on the input side).

I assume the question relates to all .txt input files for the error model.

flammie commented 5 months ago

Mm, strings compilation uses hfst-strings2fst just without any alphabets / multichars so it must consider combining characters their own arcs in the graph. I guess it leads into situation where suggestions from а̄им to вим weighs а:в and to аим weighs combining macron:0 rather. Maybe applying nfc/nfd filters in error models could work if this is actually the main issue

snomos commented 5 months ago

mm, that might be a good idea. I will have a look.

snomos commented 5 months ago

https://github.com/giellalt/giella-core/commit/c73d62a4bc57a7b773c46751a3092a3d694332e1 fixes a bug that hindered spellrestrict.* files from being built. But that is not enough, the generated spellrestrict.regex file does not include the relevant letters.

snomos commented 5 months ago

Ah - bummer on my part. The spellrestrict.* files will not solve this, because they assume the existence of an NFC form. In this case the problem is that there IS NO NFC form of the relevant letters, but we still want the FST to treat them as one symbol, ie no arch between base letter and combining diacritics.

So we need a new type of filter that finds all combining diacritics and the corresponding base letter(s), and then generates a filter of the following type:

{а̄} -> "а̄";

and then applies this filter to all error model files being read by hfst-strings2fst on both sides. That should hopefully fix the issue, and make it predictable to work with combining diacritics in the speller.

@flammie feel free to continue this work 🙂