Bugzilla Bug 1255

Date: 2012-01-18T13:26:35+01:00 From: Jack Rueter <> To: Sjur Nørstebø Moshagen <> CC: ciprian.gerstenberger, rueter.jack, trond.trosterud

Last updated: 2012-02-07T22:35:26+01:00

albbas commented 12 years ago

Comment 5615

Date: 2012-01-18 13:26:35 +0100 From: Jack Rueter <>

main/kt/kom/hfst/default-error-model.txt

Questions of weight and how to continue this optimization.

We have on line:

Transition pairs + weight:

@@ о ӧ 0.5

This should have a second Transition pair: О Ӧ "weight" But how do I arrive at the "weight"?

Scanning of literature with an Abbyy 10 has shown that the upper-case must be distinguished from the lower-case letters. А Л "weight" The upper-case "А" has been rendered where the original contains upper-case "Л" The same problem does not occur with lower-case letters "а" and "л". NB! upper-case "А" (a vowel) is not likely to be followed by a vowel in words (1%).

ь ъ "weight" There are three sets of consonants and consonant clusters preceding these two signs. (a) only the hard sign "ъ" is acceptable, in native words, (табъяс) (b) only the soft sign "ь" is acceptable (вичьяс) (c) both the hard sign "ъ" and soft sign "ь" are acceptable (канъяс vs. каньяс)

albbas commented 12 years ago

Comment 5616

Date: 2012-01-18 14:46:52 +0100 From: Sjur Nørstebø Moshagen <>

The error model file is a simple text format file used as input to a python script that will produce a transducer to correct spelling errors.

By default, all transitions are given a weight of 1.0 - if you give a specific transition another weight, suggestions generated by (a.o.) this transition will be promoted (if lower weight) or demoted (if higher weight).

albbas commented 12 years ago

Comment 5618

Date: 2012-01-19 13:56:21 +0100 From: Sjur Nørstebø Moshagen <>

Could you add a word or two in this bug report that illustrates the things you want to be able to correct, but is not yet corrected in the present speller?

albbas commented 12 years ago

Comment 5623

Date: 2012-01-20 12:55:42 +0100 From: Jack Rueter <>

(In reply to comment #2)

Could you add a word or two in this bug report that illustrates the things you want to be able to correct, but is not yet corrected in the present speller?

In skanned texts and materials recognized with an Abbyy 10, I have run into problems with initial upper-case Cyrillic L "Л"being recognized as an upper-case Cyrillic A "А". This is 1-to-1 "Лена" Lena is falsely recognized as "Аена" Aena.

A second problem will be 2-to-1 In Komi the sequence soft sign "ь" and "і" never occurs. More than likely the recognition should have been a single letter "ы".

I noticed that the individual letters 1-to-1 are separated by tabs, but when I attempted to indicate a 2-to-1 relation I failed to notice any improvement in the recognition.

Impossible combination in Komi "вьілӧ" it should read "вылӧ"

albbas commented 12 years ago

Comment 5668

Date: 2012-01-27 11:17:06 +0100 From: Sjur Nørstebø Moshagen <>

(In reply to comment #3)

A second problem will be 2-to-1 In Komi the sequence soft sign "ь" and "і" never occurs. More than likely the recognition should have been a single letter "ы".

I noticed that the individual letters 1-to-1 are separated by tabs, but when I attempted to indicate a 2-to-1 relation I failed to notice any improvement in the recognition.

It is not possible in this quite simple error model file to specify 2-to-1 or many-to-1 corrections, only 1-to-1 changes. A more complex error model formalism will have to be added as we develop the HFST speller infrastructure.

albbas commented 12 years ago

Comment 5761

Date: 2012-02-07 22:35:26 +0100 From: Sjur Nørstebø Moshagen <>

I hope all questions have been answered. I thus close this bug. But what should be done is of course to add documentation for how to make working error models for HFST spellers.

giellalt / lang-kpv

Question of how to extend the default-error-model ( #3

Bugzilla Bug 1255

Comment 5615

Transition pairs + weight:

Comment 5616

Comment 5618

Comment 5623

Comment 5668

Comment 5761