mideind / GreynirCorrect

Spelling and grammar correction for Icelandic
Other
16 stars 3 forks source link

Single word / part of sentence correction #9

Open lumpidu opened 3 years ago

lumpidu commented 3 years ago

I want to use Greynir-Correct for correction of non-whole sentences, i.e. in extreme cases single words. What method or options should I use to make that possible ?

Currently, when using the tokenize() method with option only_ci=True, it complains about the following:

Maðurin      Z002     Orð á að byrja á hástaf: 'maðurin'
Maðurinn     Z002     Orð á að byrja á hástaf: 'maðurinn'

Sample code:

from reynir_correct import tokenize

texts = ["maðurin", "maðurinn" ]

for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")
vthorsteinsson commented 3 years ago

Interesting question, and this may well be a use case that we should support better. As is, the code is mostly oriented towards review of continuous text, typically whole sentences.

The code that checks the spelling of a single token is basically around this line. The call to spelling.Corrector.correct() can optionally be provided with a context, i.e. preceding tokens that will then be used to adjust the correction probabilities based on a trigram language model.

See also the short test function at the bottom of spelling.py.

lumpidu commented 3 years ago

At least the documentation of tokenize() doesn't state assumptions about the text structure in contrast to the documentation of the methods check() or check_single(). Yes this use case exists e.g. for spell checking of web input forms, where often only single words or short text terms are entered.