Closed albbas closed 12 years ago
Date: 2012-01-18 13:26:35 +0100
From: Jack Rueter <
main/kt/kom/hfst/default-error-model.txt
Questions of weight and how to continue this optimization.
We have on line:
@@ о ӧ 0.5
This should have a second Transition pair: О Ӧ "weight" But how do I arrive at the "weight"?
Scanning of literature with an Abbyy 10 has shown that the upper-case must be distinguished from the lower-case letters. А Л "weight" The upper-case "А" has been rendered where the original contains upper-case "Л" The same problem does not occur with lower-case letters "а" and "л". NB! upper-case "А" (a vowel) is not likely to be followed by a vowel in words (1%).
ь ъ "weight" There are three sets of consonants and consonant clusters preceding these two signs. (a) only the hard sign "ъ" is acceptable, in native words, (табъяс) (b) only the soft sign "ь" is acceptable (вичьяс) (c) both the hard sign "ъ" and soft sign "ь" are acceptable (канъяс vs. каньяс)
Date: 2012-01-18 14:46:52 +0100
From: Sjur Nørstebø Moshagen <
The error model file is a simple text format file used as input to a python script that will produce a transducer to correct spelling errors.
By default, all transitions are given a weight of 1.0 - if you give a specific transition another weight, suggestions generated by (a.o.) this transition will be promoted (if lower weight) or demoted (if higher weight).
Date: 2012-01-19 13:56:21 +0100
From: Sjur Nørstebø Moshagen <
Could you add a word or two in this bug report that illustrates the things you want to be able to correct, but is not yet corrected in the present speller?
Date: 2012-01-20 12:55:42 +0100
From: Jack Rueter <
(In reply to comment #2)
Could you add a word or two in this bug report that illustrates the things you want to be able to correct, but is not yet corrected in the present speller?
In skanned texts and materials recognized with an Abbyy 10, I have run into problems with initial upper-case Cyrillic L "Л"being recognized as an upper-case Cyrillic A "А". This is 1-to-1 "Лена" Lena is falsely recognized as "Аена" Aena.
A second problem will be 2-to-1 In Komi the sequence soft sign "ь" and "і" never occurs. More than likely the recognition should have been a single letter "ы".
I noticed that the individual letters 1-to-1 are separated by tabs, but when I attempted to indicate a 2-to-1 relation I failed to notice any improvement in the recognition.
Impossible combination in Komi "вьілӧ" it should read "вылӧ"
Date: 2012-01-27 11:17:06 +0100
From: Sjur Nørstebø Moshagen <
(In reply to comment #3)
A second problem will be 2-to-1 In Komi the sequence soft sign "ь" and "і" never occurs. More than likely the recognition should have been a single letter "ы".
I noticed that the individual letters 1-to-1 are separated by tabs, but when I attempted to indicate a 2-to-1 relation I failed to notice any improvement in the recognition.
It is not possible in this quite simple error model file to specify 2-to-1 or many-to-1 corrections, only 1-to-1 changes. A more complex error model formalism will have to be added as we develop the HFST speller infrastructure.
Date: 2012-02-07 22:35:26 +0100
From: Sjur Nørstebø Moshagen <
I hope all questions have been answered. I thus close this bug. But what should be done is of course to add documentation for how to make working error models for HFST spellers.
This issue was created automatically with bugzilla2github
Bugzilla Bug 1255
Date: 2012-01-18T13:26:35+01:00 From: Jack Rueter <>
To: Sjur Nørstebø Moshagen <>
CC: ciprian.gerstenberger, rueter.jack, trond.trosterud
Last updated: 2012-02-07T22:35:26+01:00