Open snomos opened 5 months ago
The interesting thing happens when you use divvunspell
instead of hfst-ospell
:
echo а̄им | divvunspell suggest -a tools/spellcheckers/mns.zhfst
Reading from stdin...
Input: а̄им [INCORRECT]
аим 17.787788
а̄тим 24.199446
а̄гим 25.03525
аким 26.155512
ам 27.999014
агим 28.65959
атым 32.826374
аюм 34.254124
аи 43.7433
аис 43.966442
For comparison, hfst-ospell-office
gives the same output as hfst-ospell
:
echo '5 а̄им' | hfst-ospell-office -d tools/spellcheckers/mns.zhfst
@@ Loading tools/spellcheckers/mns.zhfst with args max-weight=-1.00, beam=-1.00, time-cutoff=6.00
@@ hfst-ospell-office is alive
& вим (20.06;0) аим (20.79;0) йим (21.74;0) мим (22.25;0) сым (22.52;0)
@flammie do you have comments or insights re combining diacritics and hfst-ospell
?
In any case: I am not sure how much time we should spend on this, given that it works correct using divvunspell
, and divvunspell
is used everywhere except in the grammar checker — and from the summer also in the grammar checker. That is, hfst-ospell
is not first priority.
yeah it seems quite fragile here at compilation of separate error model part already:
$ echo 'а̄' | hfst-lookup .generated/strings.all.default.hfst
hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata.
Using HFST basic transducer format and performing slow lookups
> а̄ а̄̄ 1,000000
а̄ а 2,000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а
0 0 а а 0.000000
1 21 а а 0.000000
2 43 я а 0.000000
41 44 я а 0.000000
58 58 а а 0.000000
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep а̄
$ hfst-fst2txt .generated/strings.all.default.hfst | fgrep $'\u0304'
0 0 0.000000
7 8 0.000000
13 14 @0@ 0.000000
13 27 @0@ 0.000000
15 16 @0@ 0.000000
15 28 @0@ 0.000000
17 18 @0@ 0.000000
17 29 @0@ 0.000000
19 20 @0@ 0.000000
19 30 @0@ 0.000000
21 22 @0@ 0.000000
21 31 @0@ 0.000000
23 24 @0@ 0.000000
23 32 @0@ 0.000000
25 26 @0@ 0.000000
25 33 @0@ 0.000000
58 58 0.000000
$ hfst-fst2txt .generated/strings.all.default.hfst
0 0 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
0 0 0.000000
0 0 С С 0.000000
0 0 Щ Щ 0.000000
0 0 а а 0.000000
0 0 г г 0.000000
0 0 е е 0.000000
0 0 и и 0.000000
0 0 й й 0.000000
0 0 к к 0.000000
0 0 н н 0.000000
0 0 о о 0.000000
0 0 р р 0.000000
0 0 с с 0.000000
0 0 т т 0.000000
0 0 у у 0.000000
0 0 щ щ 0.000000
0 0 ы ы 0.000000
0 0 ь ь 0.000000
0 0 э э 0.000000
0 0 ю ю 0.000000
0 0 я я 0.000000
0 0 ӈ ӈ 0.000000
0 0 ӣ ӣ 0.000000
0 0 ӯ ӯ 0.000000
0 1 @0@ @0@ 0.000000
1 2 с щ 0.000000
1 4 т к 0.000000
1 9 т т 0.000000
1 13 я я 0.000000
1 15 э э 0.000000
1 17 ы ы 0.000000
1 19 ю ю 0.000000
1 21 а а 0.000000
1 23 е е 0.000000
1 25 о о 0.000000
1 34 н ӈ 0.000000
1 36 ӈ н 0.000000
1 38 г ӈ 0.000000
1 41 С Щ 0.000000
1 45 н н 0.000000
1 48 ӯ ӯ 0.000000
1 51 у у 0.000000
2 3 ь @0@ 0.000000
2 40 ю у 0.000000
2 43 я а 0.000000
3 58 @0@ @0@ 1.000000
4 5 и и 0.000000
4 6 ӣ ӣ 0.000000
4 7 е е 0.000000
5 58 @0@ @0@ 2.000000
6 58 @0@ @0@ 2.000000
7 8 0.000000
7 58 @0@ @0@ 2.000000
8 58 @0@ @0@ 2.000000
9 10 т к 0.000000
10 11 е е 0.000000
10 12 и и 0.000000
11 58 @0@ @0@ 2.000000
12 58 @0@ @0@ 2.000000
13 14 @0@ 0.000000
13 27 @0@ 0.000000
14 58 @0@ @0@ 1.000000
15 16 @0@ 0.000000
15 28 @0@ 0.000000
16 58 @0@ @0@ 1.000000
17 18 @0@ 0.000000
17 29 @0@ 0.000000
18 58 @0@ @0@ 1.000000
19 20 @0@ 0.000000
19 30 @0@ 0.000000
20 58 @0@ @0@ 1.000000
21 22 @0@ 0.000000
21 31 @0@ 0.000000
22 58 @0@ @0@ 1.000000
23 24 @0@ 0.000000
23 32 @0@ 0.000000
24 58 @0@ @0@ 1.000000
25 26 @0@ 0.000000
25 33 @0@ 0.000000
26 58 @0@ @0@ 1.000000
27 58 @0@ @0@ 2.000000
28 58 @0@ @0@ 2.000000
29 58 @0@ @0@ 2.000000
30 58 @0@ @0@ 2.000000
31 58 @0@ @0@ 2.000000
32 58 @0@ @0@ 2.000000
33 58 @0@ @0@ 2.000000
34 35 г @0@ 0.000000
35 58 @0@ @0@ 2.000000
36 37 @0@ г 0.000000
37 58 @0@ @0@ 2.000000
38 39 н н 0.000000
39 58 @0@ @0@ 3.000000
40 58 @0@ @0@ 2.000000
41 42 ю у 0.000000
41 44 я а 0.000000
42 58 @0@ @0@ 2.000000
43 58 @0@ @0@ 2.000000
44 58 @0@ @0@ 2.000000
45 46 т р 0.000000
46 47 р @0@ 0.000000
47 58 @0@ @0@ 4.000000
48 49 й и 0.000000
48 54 й ы 0.000000
49 50 и @0@ 0.000000
50 58 @0@ @0@ 4.000000
51 52 й и 0.000000
51 56 й ы 0.000000
52 53 и @0@ 0.000000
53 58 @0@ @0@ 4.000000
54 55 ы @0@ 0.000000
55 58 @0@ @0@ 4.000000
56 57 ы @0@ 0.000000
57 58 @0@ @0@ 4.000000
58 58 @_IDENTITY_SYMBOL_@ @_IDENTITY_SYMBOL_@ 0.000000
58 58 0.000000
58 58 С С 0.000000
58 58 Щ Щ 0.000000
58 58 а а 0.000000
58 58 г г 0.000000
58 58 е е 0.000000
58 58 и и 0.000000
58 58 й й 0.000000
58 58 к к 0.000000
58 58 н н 0.000000
58 58 о о 0.000000
58 58 р р 0.000000
58 58 с с 0.000000
58 58 т т 0.000000
58 58 у у 0.000000
58 58 щ щ 0.000000
58 58 ы ы 0.000000
58 58 ь ь 0.000000
58 58 э э 0.000000
58 58 ю ю 0.000000
58 58 я я 0.000000
58 58 ӈ ӈ 0.000000
58 58 ӣ ӣ 0.000000
58 58 ӯ ӯ 0.000000
58 0.000000
Do I read the above correct, @flammie, when I find no occurrences of а̄
in the ATT version of strings.all.default.hfst
? So the base char + combining diacritic is lost during compilation?
If so, how can we force such a sequence to be treated as one symbol, in all contexts? The lexical FST does treat them as one symbol (as opposed to the tokeniser FST, which does the opposite on the input side).
I assume the question relates to all .txt
input files for the error model.
Mm, strings compilation uses hfst-strings2fst just without any alphabets / multichars so it must consider combining characters their own arcs in the graph. I guess it leads into situation where suggestions from а̄им to вим weighs а:в and to аим weighs combining macron
:0 rather. Maybe applying nfc/nfd filters in error models could work if this is actually the main issue
mm, that might be a good idea. I will have a look.
https://github.com/giellalt/giella-core/commit/c73d62a4bc57a7b773c46751a3092a3d694332e1 fixes a bug that hindered spellrestrict.*
files from being built. But that is not enough, the generated spellrestrict.regex
file does not include the relevant letters.
Ah - bummer on my part. The spellrestrict.*
files will not solve this, because they assume the existence of an NFC form. In this case the problem is that there IS NO NFC form of the relevant letters, but we still want the FST to treat them as one symbol, ie no arch between base letter and combining diacritics.
So we need a new type of filter that finds all combining diacritics and the corresponding base letter(s), and then generates a filter of the following type:
{а̄} -> "а̄";
and then applies this filter to all error model files being read by hfst-strings2fst
on both sides. That should hopefully fix the issue, and make it predictable to work with combining diacritics in the speller.
@flammie feel free to continue this work 🙂
@Trondtr has reported:
Now compare this with the following:
Why is
ами
suggested as second?The core of the issue is that
а̄
has a base char + combining macron: how well / bad does the error model handle combining diacritics when it comes to suggestions?