proycon / analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
GNU General Public License v3.0
31 stars 4 forks source link

Question: Splitting runons #17

Open pirolen opened 1 year ago

pirolen commented 1 year ago

Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form.

Suppose that 'holygrail' is actually a runon error after OCR, then I would like to be able to return a suggestion of 'holy grail'.

Is there a way to do it?

The other way round it works, i.e. for erroneous splits the concatenated forms are retrieved, e.g.

bitter sweet bittersweet 0.7604166666666666 bittersweets 0.6666666666666667

proycon commented 1 year ago

Yes, it should be possible to let analiticcl generate variants involving a whitespace. It simply entails heaving such bigrams explicitly in your input lexicon (it need not be constrained to single words).

There's also a possibility if you use search mode, where you can load a language model. Though I'm not entirely how that would play out in such cases. Itmight still need an expanded lexicon.

There may be room for improvement in this area.

pirolen commented 1 year ago

Thanks! I now simply used analiticcl search --alphabet simple.alphabet.tsv --lexicon eng.aspell.lexicon --lm-order 3

(accepting standard input; enter text to search for variants, output may be delayed until end of input, enter an empty line to force output earlier)
holygrail

holygrail       0:9

Please don't hesitate to suggest meaningful parameter usage for my case.

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

pirolen commented 1 year ago

My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(

Short update: I also tried to run analiticcl in a Colab notebook. model.build() prints no output there at all to stdout, unlike in the tutorial. Calling it from the command line on Ubuntu prints

Computing anagram values for all items in the lexicon...
 - Found 99999 instances
Adding all instances to the index...
 - Found 1 anagrams
Creating sorted secondary index...
Sorting secondary index...
 - Found 1 anagrams of length 1
Constructing Language Model...
 - No language model provided

And no matter what word I query with find_variants, the result is always the same 2 lines, returning the first two items in the lexicon file :-o

{'text': 'и', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}
{'text': 'не', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}

The files are UTF-8. I am going to see if I can convert them to Unicode Normal Form C and whether that makes a difference.

I think by now I tested installations by both cargo and pip.

proycon commented 1 year ago

Short update: I also tried to run analiticcl in a Colab notebook.

Can you share the notebook? (along with all input files). Then I can check if I can see what's happening.

pirolen commented 1 year ago

Thanks very much! I have sent an invitation to your email address.

proycon commented 1 year ago

Got it, something's going wrong with the anagram computation based on the alphabet file. I'm investigating...

proycon commented 1 year ago

There was a serious bug in the multibyte handling that came to light thanks to your example. I'm doing a new analiticcl release tonight (v0.4.5) that will fix this.

proycon commented 1 year ago

Released now! (both on crates.io and pypi)

proycon commented 1 year ago

Example output from your test in the new situation:

$ analiticcl query --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv      
...
Querying the model...
(accepting standard input; enter input to match, one per line, output may be delayed until end of input due to parallellisation)
жизни
жизни   жизни   1               жиѕни   1               жиꙁни   1               жизнї   1               ѡжизни  0.775           жизнїю  0.775           жиꙁни∙  0.775           жизныи  0.75            изни    0.725
proycon commented 1 year ago

I also added a testinput mode which you can use to check if a particular input is covered by your alphabet:

$ analiticcl testinput --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv
ꙗванi
OK: ꙗванi       9710701 [4, 23, 5, 28, 3]
blah
UNKNOWN: blah   50308609        [37, 37, 5, 37]

(the highest number in the array (37) corresponds to an unknown character, all the non-cyrillic once in this case). This may help in improving coverage of your alphabet.

pirolen commented 1 year ago

Fantastic, thank you so much! I'm excited to test it asap!

pirolen commented 1 year ago

Awesome, both the module from pypi and the CLI version now work fine! I am going to explore the different modes.

pirolen commented 1 year ago

https://github.com/proycon/analiticcl/issues/17#issuecomment-1438801583

Not sure if this is of interest, but if I run the same CLI command with testinput with the same files now, and copy-paste ꙗванi from this issue into the CLI, although all of its letters are in my alphabet file, it does not get accepted :-(

UNKNOWN: ꙗванi 111299573 [4, 23, 35, 28, 3]

Also, if I copy-paste some tokens from the lexocon file opened in my VS Code editor into VS Code Terminal CLI, I get surprises:

ихъ UNKNOWN: ихъ 202193 [35, 16, 8] хъ OK: хъ 1357 [16, 8] UNKNOWN: и 149 [35]

But 'и' is in the alphabet file, it can be searched for and is found.