jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Segmentation fault in damerau_levenshtein_distance (32 bit) #120

Closed nijel closed 1 year ago

nijel commented 5 years ago

The damerau_levenshtein_distance function segfaults with long strings on 32-bit platform. This was originally reported at https://github.com/WeblateOrg/weblate/issues/2673 by @scarabeusiv, but the actual segfault happens in jellyfish:

abuild@bugaboo:~> python3
Python 3.7.2 (default, Dec 30 2018, 16:18:15) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from jellyfish import damerau_levenshtein_distance
>>> print(damerau_levenshtein_distance('a' * 200000, 'b' * 200000))
Segmentation fault (core dumped)

Also if I use the C extension instead it reports memory error:

>>> from jellyfish.cjellyfish import damerau_levenshtein_distance
>>> print(damerau_levenshtein_distance('a' * 200000, 'b' * 200000))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
nijel commented 5 years ago

The memory error is expected in this case as it is allocating (roughly) 200000 * 200000 * 4 bytes. I can not reproduce the segfault on my (Debian) system. @scarabeusiv maybe there is something wrong in SUSE Python build?

I don't see anything problematic in related code:

https://github.com/jamesturk/cjellyfish/blob/45d8e3cfdbb7694db55bb105c8a86ec45185181d/jellyfishmodule.c#L107-L127

https://github.com/jamesturk/cjellyfish/blob/45d8e3cfdbb7694db55bb105c8a86ec45185181d/damerau_levenshtein.c#L26-L30

scarabeusiv commented 5 years ago

Looking at the code it indeed looks fine. The traceback for the crash:

Program received signal SIGSEGV, Segmentation fault.
damerau_levenshtein_distance (s1=0xf74d5010 L'a' <repeats 200 times>..., s2=0xf7411010 L'b' <repeats 200 times>..., len1=200000, len2=200000) at cjellyfish/damerau_levenshtein.c:35
35  cjellyfish/damerau_levenshtein.c: No such file or directory.
(gdb) bt
#0  damerau_levenshtein_distance (s1=0xf74d5010 L'a' <repeats 200 times>..., s2=0xf7411010 L'b' <repeats 200 times>..., len1=200000, len2=200000) at cjellyfish/damerau_levenshtein.c:35
#1  0xf77037cb in jellyfish_damerau_levenshtein_distance (self=0xf77acacc, args=0xf779cdac) at cjellyfish/jellyfishmodule.c:153
#2  0xf7d19ef5 in _PyMethodDef_RawFastCallKeywords (method=0xf770a160 <jellyfish_methods+64>, self=0xf77acacc, args=0xf77dd478, nargs=2, kwnames=0x0) at Objects/call.c:694
#3  0xf7d19da6 in _PyCFunction_FastCallKeywords (func=0xf77accfc, args=0xf77dd478, nargs=2, kwnames=0x0) at Objects/call.c:730
#4  0xf7daffef in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at Python/ceval.c:4568
#5  _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3124
#6  0xf7daad9c in _PyEval_EvalCodeWithName (_co=<optimized out>, globals=0xf7867734, locals=0xf7867734, args=0x0, argcount=0, kwnames=0x0, kwargs=0x0, kwcount=<optimized out>, kwstep=2, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:3930
#7  0xf7daabbd in PyEval_EvalCodeEx (_co=0xf77ecee8, globals=0xf7867734, locals=0xf7867734, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3959
#8  0xf7daab50 in PyEval_EvalCode (co=0xf77ecee8, globals=0xf7867734, locals=0xf7867734) at Python/ceval.c:524
#9  0xf7e3daad in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0xf7867734, locals=0xf7867734, flags=0xffffd8c8, arena=0xf7906c80) at Python/pythonrun.c:1035
#10 0xf7cdb8ad in PyRun_InteractiveOneObjectEx (fp=fp@entry=0xf7c2a580 <_IO_2_1_stdin_>, filename=filename@entry=0xf784a7c0, flags=flags@entry=0xffffd8c8) at Python/pythonrun.c:256
#11 0xf7cdb526 in PyRun_InteractiveLoopFlags (fp=0xf7c2a580 <_IO_2_1_stdin_>, filename_str=0xf7e959be "<stdin>", flags=0xffffd8c8) at Python/pythonrun.c:120
#12 0xf7cdb238 in PyRun_AnyFileExFlags (fp=0xf7f66000, filename=0xf7e959be "<stdin>", closeit=-134454456, flags=0xf7e452e8 <pymain_main+888>) at Python/pythonrun.c:78
#13 0x00000001 in ?? ()
#14 0xf7f66000 in ?? () from /usr/lib/libpython3.7m.so.1.0
#15 0xffffd8f4 in ?? ()
#16 0x00000000 in ?? ()

@mcepl any ideas if we do something specific to 32bit on py3? I quickly checked and it does not seem so.

nijel commented 5 years ago

Looking again at the code, I think the problem might be in overflow in the (len1 + 2) * cols * sizeof(size_t) expression. The allocation then doesn't fail and further code is executed with smaller buffer than it expects leading to crash.

scarabeusiv commented 5 years ago

Valgrind output with >1GB allocation attempt:

>>> print(damerau_levenshtein_distance('a' * 200000, 'b' * 200000))
==4532== Warning: set address range perms: large range [0x5a18028, 0x46909438) (undefined)
==4532== Invalid write of size 4
==4532==    at 0x590A500: damerau_levenshtein_distance (damerau_levenshtein.c:36)
==4532==    by 0x59097CA: jellyfish_damerau_levenshtein_distance (jellyfishmodule.c:153)
==4532==    by 0x4915EF4: _PyMethodDef_RawFastCallKeywords (call.c:694)
==4532==    by 0x4915DA5: _PyCFunction_FastCallKeywords (call.c:730)
==4532==    by 0x49ABFEE: call_function (ceval.c:4568)
==4532==    by 0x49ABFEE: _PyEval_EvalFrameDefault (ceval.c:3124)
==4532==    by 0x49A6D9B: _PyEval_EvalCodeWithName (ceval.c:3930)
==4532==    by 0x49A6BBC: PyEval_EvalCodeEx (ceval.c:3959)
==4532==    by 0x49A6B4F: PyEval_EvalCode (ceval.c:524)
==4532==    by 0x4A39AAC: run_mod (pythonrun.c:1035)
==4532==    by 0x48D78AC: PyRun_InteractiveOneObjectEx (pythonrun.c:256)
==4532==    by 0x48D7525: PyRun_InteractiveLoopFlags (pythonrun.c:120)
==4532==    by 0x48D7237: PyRun_AnyFileExFlags.cold.10 (pythonrun.c:78)
==4532==  Address 0x4693a4bc is not stack'd, malloc'd or (recently) free'd
==4532== 
==4532== 
==4532== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==4532==  Access not within mapped region at address 0x4693A4BC
==4532==    at 0x590A500: damerau_levenshtein_distance (damerau_levenshtein.c:36)
==4532==    by 0x59097CA: jellyfish_damerau_levenshtein_distance (jellyfishmodule.c:153)
==4532==    by 0x4915EF4: _PyMethodDef_RawFastCallKeywords (call.c:694)
==4532==    by 0x4915DA5: _PyCFunction_FastCallKeywords (call.c:730)
==4532==    by 0x49ABFEE: call_function (ceval.c:4568)
==4532==    by 0x49ABFEE: _PyEval_EvalFrameDefault (ceval.c:3124)
==4532==    by 0x49A6D9B: _PyEval_EvalCodeWithName (ceval.c:3930)
==4532==    by 0x49A6BBC: PyEval_EvalCodeEx (ceval.c:3959)
==4532==    by 0x49A6B4F: PyEval_EvalCode (ceval.c:524)
==4532==    by 0x4A39AAC: run_mod (pythonrun.c:1035)
==4532==    by 0x48D78AC: PyRun_InteractiveOneObjectEx (pythonrun.c:256)
==4532==    by 0x48D7525: PyRun_InteractiveLoopFlags (pythonrun.c:120)
==4532==    by 0x48D7237: PyRun_AnyFileExFlags.cold.10 (pythonrun.c:78)
==4532==  If you believe this happened as a result of a stack
==4532==  overflow in your program's main thread (unlikely but
==4532==  possible), you can try to increase the size of the
==4532==  main thread stack using the --main-stacksize= flag.
==4532==  The main thread stack size used in this run was 8388608.
==4532== 
==4532== HEAP SUMMARY:
==4532==     in use at exit: 18,695,879 bytes in 1,024 blocks
==4532==   total heap usage: 13,188 allocs, 12,164 frees, 1,095,605,104 bytes allocated
==4532== 
==4532== LEAK SUMMARY:
==4532==    definitely lost: 37,008 bytes in 18 blocks
==4532==    indirectly lost: 0 bytes in 0 blocks
==4532==      possibly lost: 0 bytes in 0 blocks
==4532==    still reachable: 18,658,871 bytes in 1,006 blocks
==4532==         suppressed: 0 bytes in 0 blocks
==4532== Rerun with --leak-check=full to see details of leaked memory
==4532== 
==4532== For counts of detected and suppressed errors, rerun with: -v
==4532== Use --track-origins=yes to see where uninitialised values come from
==4532== ERROR SUMMARY: 558 errors from 73 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)