Closed JB-doogls closed 2 months ago
I can reproduce the issue. It doesn't appear to be linked to the emojis specifically since I can reproduce the issue with only ASCII characters as well.
In debug builds there is actually a failing assert in the implementation for this dataset. It's not instantly obvious what exactly is going wrong. So I will need to have a deeper look at the algorithm.
I did find and fix the bug in the C implementation: https://github.com/rapidfuzz/rapidfuzz-cpp/commit/cbdf84388cea0f12d8a02d9bac28d806b178302a
I will make a new release with the fix later today.
Should be fixed by upgrading rapidfuzz to version 3.9.4
Thank you!
Hi. Found some problems with memory allocation in Levenshtein.opcodes on two texts contained emoji. Tested on
Texts attached failed_string_first.txt failed_string_sec.txt
the following code failed with
I didn't check the C sources, but think, that it may be some issue with pages size pre-allocated on start of the algo. Couple of simple hacks with input strings leads to avoid the problem
Simple test with removed emoji goes fine
and also a turn with cutting second string with len of first
system
Here is the backtrace from code runs under
gdb
with debug python built withEXTRA_CFLAGS="-DPYMALLOC_DEBUG -DPy_REF_DEBUG"