Open pirolen opened 1 year ago
Yes, it should be possible to let analiticcl generate variants involving a whitespace. It simply entails heaving such bigrams explicitly in your input lexicon (it need not be constrained to single words).
There's also a possibility if you use search mode, where you can load a language model. Though I'm not entirely how that would play out in such cases. Itmight still need an expanded lexicon.
There may be room for improvement in this area.
Thanks!
I now simply used
analiticcl search --alphabet simple.alphabet.tsv --lexicon eng.aspell.lexicon --lm-order 3
(accepting standard input; enter text to search for variants, output may be delayed until end of input, enter an empty line to force output earlier)
holygrail
holygrail 0:9
Please don't hesitate to suggest meaningful parameter usage for my case.
My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(
My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-(
Short update: I also tried to run analiticcl in a Colab notebook.
model.build()
prints no output there at all to stdout, unlike in the tutorial.
Calling it from the command line on Ubuntu prints
Computing anagram values for all items in the lexicon...
- Found 99999 instances
Adding all instances to the index...
- Found 1 anagrams
Creating sorted secondary index...
Sorting secondary index...
- Found 1 anagrams of length 1
Constructing Language Model...
- No language model provided
And no matter what word I query with find_variants
, the result is always the same 2 lines, returning the first two items in the lexicon file :-o
{'text': 'и', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}
{'text': 'не', 'score': 1.0, 'dist_score': 1.0, 'freq_score': 1.0, 'lexicons': ['/home/pirol/quanti/devel/analiti/01_jan_car_lexfile_plain.tsv']}
The files are UTF-8. I am going to see if I can convert them to Unicode Normal Form C and whether that makes a difference.
I think by now I tested installations by both cargo and pip.
Short update: I also tried to run analiticcl in a Colab notebook.
Can you share the notebook? (along with all input files). Then I can check if I can see what's happening.
Thanks very much! I have sent an invitation to your email address.
Got it, something's going wrong with the anagram computation based on the alphabet file. I'm investigating...
There was a serious bug in the multibyte handling that came to light thanks to your example. I'm doing a new analiticcl release tonight (v0.4.5) that will fix this.
Released now! (both on crates.io and pypi)
Example output from your test in the new situation:
$ analiticcl query --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv
...
Querying the model...
(accepting standard input; enter input to match, one per line, output may be delayed until end of input due to parallellisation)
жизни
жизни жизни 1 жиѕни 1 жиꙁни 1 жизнї 1 ѡжизни 0.775 жизнїю 0.775 жиꙁни∙ 0.775 жизныи 0.75 изни 0.725
I also added a testinput
mode which you can use to check if a particular input is covered by your alphabet:
$ analiticcl testinput --alphabet alphabet.true.tsv --lexicon 01_jan_car_lexfile_plain.lexicon.tsv
ꙗванi
OK: ꙗванi 9710701 [4, 23, 5, 28, 3]
blah
UNKNOWN: blah 50308609 [37, 37, 5, 37]
(the highest number in the array (37) corresponds to an unknown character, all the non-cyrillic once in this case). This may help in improving coverage of your alphabet.
Fantastic, thank you so much! I'm excited to test it asap!
Awesome, both the module from pypi and the CLI version now work fine! I am going to explore the different modes.
https://github.com/proycon/analiticcl/issues/17#issuecomment-1438801583
Not sure if this is of interest, but if I run the same CLI command with testinput
with the same files now, and copy-paste ꙗванi
from this issue into the CLI, although all of its letters are in my alphabet file, it does not get accepted :-(
UNKNOWN: ꙗванi 111299573 [4, 23, 35, 28, 3]
Also, if I copy-paste some tokens from the lexocon file opened in my VS Code editor into VS Code Terminal CLI, I get surprises:
ихъ UNKNOWN: ихъ 202193 [35, 16, 8] хъ OK: хъ 1357 [16, 8] UNKNOWN: и 149 [35]
But 'и' is in the alphabet file, it can be searched for and is found.
Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form.
Suppose that 'holygrail' is actually a runon error after OCR, then I would like to be able to return a suggestion of 'holy grail'.
Is there a way to do it?
The other way round it works, i.e. for erroneous splits the concatenated forms are retrieved, e.g.