proycon / analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
GNU General Public License v3.0
31 stars 4 forks source link

reading variants file #14

Closed HoekR closed 2 years ago

HoekR commented 2 years ago

I try to match person names with (py) analiticcl. Reading a lexicon into a VariantModel works ok with

xmodel = VariantModel(os.path.join(bdir, "examples/simple.alphabet.tsv"), Weights(), debug=False)
xmodel.read_lexicon('<directory>/1714.tsv')

in which 1714.tsv contains a list of names like. Note that I collapsed all white spaces in names as analiticcl does not like names with spaces

cau
pesters
vanaylva
llaersma
lestevonon
rengers
lemckervanbreda
rouse
siccama
vanheeckerenvanbrantsenburg
vanalderweereld
vanhaeften
vanlyndenvanblitterswyk

This works for name searching.

However, I have a number of variants (mostly OCR variations), that are put into a tab separated file conform the instructions in the analiticcl documentation:

vanbroeckhuysen     vanbroeckbuysen 6.0 wanbroeckhuysen 3.0 vanbroeckhbuysen    2.0 vanbroeckhusen  2.0
vanessen        essenius    41.0    johanvanessen   40.0    hendrickvanessen    14.0    wanessen    5.0
vanalphen       vanaphen    8.0 wanalphen   2.0 vanalpen    1.0 vanalpben   1.0
graswinckel     grawinckel  1.0
sloet       sloe    1.0

but analiticcl refuses to read it with an error:

PanicException: Variant scores must be a floating point value (line 1 of <lexicon file>.tsv): ParseFloatError { kind: Invalid }

what is wrong with the format? also: even if I can collapse names, especially for shorter names this seems to be less than ideal

proycon commented 2 years ago

Your format is ok aside from the fact that the variant scores should be in the range (0.0 - 1.0), so you'll want to normalize them before feeding them to analiticcl. However, that's not the cause of the error. That was most likely due to a bug in analiticcl which I fixed in e7539413c5a633fdee53a7c9a4d65b42dca16024 (v0.3.2) last week. Is your version up to date? try pip install -U analiticcl.

HoekR commented 2 years ago

my pip (on macos) refuses to update and reports that analiticcl is up to date. A manual download of the

analiticcl-0.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl

reports an invalid wheel for this platform

proycon commented 2 years ago

If it says up to date then you must be on v0.3.2 already? (try pip show analiticcl | grep Version). If that's the case and it still doesn't work then I need to do some further debugging, it could be that my fix was not sufficient. I'll check.

analiticcl-0.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl

reports an invalid wheel for this platform

Yeah, that wheel is for linux on x86_64 with glibc and python 3.8, that won't work on macOS. I haven't published any macOS-specific wheels (not sure how to do that even to be honest), but it should just compile from source there (requiring rustc and cargo). After all it seems it initially installed fine for you, right?

proycon commented 2 years ago

I think your variant list may have one column too much (one extra tab after the first column), unless that was a copy-past thing possibly. If I copy your variant list as you pasted it here I get:

$ analiticcl query --alphabet ~W/analiticcl/examples/simple.alphabet.tsv --variants variantlist2.tsv < test.txt
Initializing model...
Loading lexicons...
thread 'main' panicked at 'Variant scores must be a floating point value (line 1 of variantlist2.tsv, got vanbroeckbuysen), no frequency information: ParseFloatError { kind: Invalid }', src/lib.rs:557:62
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Note that the error message is a bit more verbose, so to me that still suggests you're still on 0.3.1 rather than 0.3.2. If I remove the extra column it works as intended.

HoekR commented 2 years ago

yes, pip just refuses to update to version 0.3.2 saying:

pip install analiticcl==0.3.2
ERROR: Could not find a version that satisfies the requirement analiticcl==0.3.2 (from versions: 0.3.1)
ERROR: No matching distribution found for analiticcl==0.3.2

the only difference I can detect is the availability of a analiticcl-0.3.1.tar.gz source distribution at https://pypi.org/project/analiticcl/0.3.1/#files while the 0.3.2 only has linux wheels

proycon commented 2 years ago

Ha! But that is a very good clue indeed! It seems I forgot to publish the source tarball to PyPi and only did the wheels! It should be fixed now.

HoekR commented 2 years ago

yup, it works now:

Successfully uninstalled analiticcl-0.3.1
Successfully installed analiticcl-0.3.2