Levenshtein.opcodes corrupted size vs. prev_size on texts contains emoji

JB-doogls commented 3 months ago

Hi. Found some problems with memory allocation in Levenshtein.opcodes on two texts contained emoji. Tested on

Python 3.10.13
Levenshtein==0.25.1
rapidfuzz==3.9.0

Texts attached failed_string_first.txt failed_string_sec.txt

the following code failed with

corrupted size vs. prev_size
Aborted

import Levenshtein

f = open('failed_string_first.txt', 'r').read()
s = open('failed_string_sec.txt', 'r').read()
print(len(Levenshtein.opcodes(f, s)))

I didn't check the C sources, but think, that it may be some issue with pages size pre-allocated on start of the algo. Couple of simple hacks with input strings leads to avoid the problem

Simple test with removed emoji goes fine

import Levenshtein

f = open('failed_string_first.txt', 'r').read()
s = open('failed_string_sec.txt', 'r').read()

no_emoji_f = ''.join(c for c in f if ord(c) < 128000)
no_emoji_s = ''.join(c for c in s if ord(c) < 128000)
print(len(Levenshtein.opcodes(no_emoji_f, no_emoji_s)))

and also a turn with cutting second string with len of first

print(len(Levenshtein.opcodes(f, s[:len(f)])))

system

NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"

Here is the backtrace from code runs under gdb with debug python built with EXTRA_CFLAGS="-DPYMALLOC_DEBUG -DPy_REF_DEBUG"

Thread 1 "python-debug" received signal SIGABRT, Aborted.
0x00007ffff7afbce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff7afbce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7ae5537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff7b3d3e8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff7b446da in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff7b4540c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffff7b45bfb in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007fffe47b49de in void rapidfuzz::detail::levenshtein_align<unsigned int*, unsigned int*>(rapidfuzz::Editops&, rapidfuzz::detail::Range<unsigned int*> const&, rapidfuzz::detail::Range<unsigned int*> const&, unsigned long, unsigned long, unsigned long, unsigned long) () from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#7  0x00007fffe4722418 in void rapidfuzz::detail::levenshtein_align_hirschberg<unsigned int*, unsigned int*>(rapidfuzz::Editops&, rapidfuzz::detail::Range<unsigned int*>, rapidfuzz::detail::Range<unsigned int*>, unsigned long, unsigned long, unsigned long, unsigned long) () from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#8  0x00007fffe47225d6 in void rapidfuzz::detail::levenshtein_align_hirschberg<unsigned int*, unsigned int*>(rapidfuzz::Editops&, rapidfuzz::detail::Range<unsigned int*>, rapidfuzz::detail::Range<unsigned int*>, unsigned long, unsigned long, unsigned long, unsigned long) () from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#9  0x00007fffe47225d6 in void rapidfuzz::detail::levenshtein_align_hirschberg<unsigned int*, unsigned int*>(rapidfuzz::Editops&, rapidfuzz::detail::Range<unsigned int*>, rapidfuzz::detail::Range<unsigned int*>, unsigned long, unsigned long, unsigned long, unsigned long) () from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#10 0x00007fffe4685da9 in auto visit<visitor<levenshtein_editops_func(_RF_String const&, _RF_String const&, unsigned long)::{lambda(auto:1, auto:2)#1}>(_RF_String const&, _RF_String const&, levenshtein_editops_func(_RF_String const&, _RF_String const&, unsigned long)::{lambda(auto:1, auto:2)#1}&&)::{lambda(auto:1)#1}>(_RF_String const&, levenshtein_editops_func(_RF_String const&, _RF_String const&, unsigned long)::{lambda(auto:1, auto:2)#1}&&) [clone .lto_priv.0] ()
   from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#11 0x00007fffe46a52db in __pyx_pw_9rapidfuzz_8distance_16metrics_cpp_avx2_11levenshtein_opcodes(_object*, _object* const*, long, _object*) [clone .lto_priv.0] ()
   from /opt/python3.10/lib/python3.10/site-packages/rapidfuzz/distance/metrics_cpp_avx2.cpython-310-x86_64-linux-gnu.so
#12 0x00007ffff7e00282 in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#13 0x00007ffff7dfc271 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#14 0x00007ffff7e00282 in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#15 0x00007ffff7dfc271 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#16 0x00007ffff7dfe27d in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#17 0x00007ffff7dfc271 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#18 0x00007ffff7dfd37b in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#19 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#20 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#21 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#22 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#23 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#24 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#25 0x00007ffff7e1040e in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#26 0x00007ffff7cfe148 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#27 0x00007ffff7e1040e in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#28 0x00007ffff7cfe148 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#29 0x00007ffff7e1040e in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#30 0x00007ffff7cfe148 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#31 0x00007ffff7e1040e in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#32 0x00007ffff6b5e476 in ?? () from /opt/python3.10/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so
#33 0x00007ffff7e03760 in _PyObject_MakeTpCall () from /opt/python3.10/lib/libpython3.10.so.1.0
#34 0x00007ffff7d8d7c5 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#35 0x00007ffff7e029f3 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#36 0x00007ffff7e01225 in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#37 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#38 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#39 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#40 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#41 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#42 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#43 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#44 0x00007ffff7dfd1ce in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#45 0x00007ffff7dfc271 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#46 0x00007ffff7e0cfb5 in PyVectorcall_Call () from /opt/python3.10/lib/libpython3.10.so.1.0
#47 0x00007ffff7e8233c in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#48 0x00007ffff7e03760 in _PyObject_MakeTpCall () from /opt/python3.10/lib/libpython3.10.so.1.0
#49 0x00007ffff7dfd37b in _PyEval_EvalFrameDefault () from /opt/python3.10/lib/libpython3.10.so.1.0
#50 0x00007ffff7dfbec4 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#51 0x00007ffff7e552d0 in PyEval_EvalCode () from /opt/python3.10/lib/libpython3.10.so.1.0
#52 0x00007ffff7e7787d in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#53 0x00007ffff7e75276 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#54 0x00007ffff7e743a1 in ?? () from /opt/python3.10/lib/libpython3.10.so.1.0
#55 0x00007ffff7e7416a in _PyRun_SimpleFileObject () from /opt/python3.10/lib/libpython3.10.so.1.0
#56 0x00007ffff7e73f63 in _PyRun_AnyFileObject () from /opt/python3.10/lib/libpython3.10.so.1.0
#57 0x00007ffff7e7376e in Py_RunMain () from /opt/python3.10/lib/libpython3.10.so.1.0
#58 0x00007ffff7e49ee9 in Py_BytesMain () from /opt/python3.10/lib/libpython3.10.so.1.0
#59 0x00007ffff7ae6d0a in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#60 0x000055555555507a in _start ()

maxbachmann commented 3 months ago

I can reproduce the issue. It doesn't appear to be linked to the emojis specifically since I can reproduce the issue with only ASCII characters as well.

In debug builds there is actually a failing assert in the implementation for this dataset. It's not instantly obvious what exactly is going wrong. So I will need to have a deeper look at the algorithm.

maxbachmann commented 3 months ago

I did find and fix the bug in the C implementation: https://github.com/rapidfuzz/rapidfuzz-cpp/commit/cbdf84388cea0f12d8a02d9bac28d806b178302a

I will make a new release with the fix later today.

maxbachmann commented 3 months ago

Should be fixed by upgrading rapidfuzz to version 3.9.4

JB-doogls commented 2 months ago

Thank you!

rapidfuzz / python-Levenshtein

Levenshtein.opcodes corrupted size vs. prev_size on texts contains emoji #9