jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

v1.0 Plan: move backend to rust-jellyfish #177

Closed jamesturk closed 1 year ago

jamesturk commented 1 year ago

It's time to leave C behind.

jamesturk commented 1 year ago

After the latest round of improvments to the Rust versions (mostly switching to SmallVec) most Rust versions are around 1.5-2x the speed of the C versions. Damerau is the exception at 4-5x, the use of HashMap the likely culprit. (The C version used a custom Trie)

This is fine for 0.11, since the safety/unicode tradeoff here is huge and its still a lot faster than Python. Will still probably explore a Trie to improve Damerau.

jamesturk commented 1 year ago

Going to let 0.11 sit for a while to shake out any packaging issues. Once there's been a chance for people to test, I'll release 1.0.

maxbachmann commented 1 year ago

Originally the library used the c implementation if available and did fall back to a pure Python version if it was not. This has the advantage, that on platforms where no wheels are available + no c compiler was installed it would still work (albeit at a lower performance). To my understanding this behavior was completely dropped in version 0.11.0. On these platforms it is even more unlikely for a rust compiler to be preinstalled.

An example of this is:

podman run -it alpine
>>> apk add --update --no-cache python3 && ln -sf python3 /usr/bin/python
>>> python3 -m ensurepip
# fails since it can not build the package
>>> python3 -m pip install jellyfish
# installs the pure Python fallback
>>> python3 -m pip install jellyfish==0.10.0

Is this an oversight, or are breaking changes like this considered fine in minor versions of jellyfish -> people should pin minor package versions

jamesturk commented 1 year ago

The plan is to produce wheels for all major platforms, I currently do plan to remove the Python implementations but might reconsider that if there are platforms it is hard to provide prebuilt binaries for.

I just pushed 0.11.2 which has a small speedup as well as changes to the build process that should fix installation on alpine.

jamesturk commented 1 year ago

For anyone interested, given issues like #184 I think I'll restore the automatic fallback option for now.