infoscout / weighted-levenshtein

Weighted Levenshtein library
MIT License
105 stars 26 forks source link

Array2D_init does not validate memory allocation #29

Open maxbachmann opened 2 years ago

maxbachmann commented 2 years ago

Array2D_init currently has the following implementation:

cdef inline void Array2D_init(
    Array2D* array2d,
    Py_ssize_t num_rows,
    Py_ssize_t num_cols) nogil:
    """
    Initializes an Array2D struct with the given number of rows and columns
    """
    array2d.num_rows = num_rows
    array2d.num_cols = num_cols
    array2d.mem = <DTYPE_t*> malloc(num_rows * num_cols * sizeof(DTYPE_t))

Problems with the implementation

1) this does not validate whether malloc succeeds. E.g. for:

from weighted_levenshtein import dam_lev
s1="dkjnsdbjkadbjkalsask"*10000
s2="ksjdbhajhsjadksjaj"*10000
dam_lev(s1, s2)

this leads to a segmentation fault on my machine:

==169035== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==169035==  Access not within mapped region at address 0x0
==169035==    at 0x1347A76B: __pyx_f_20weighted_levenshtein_4clev_c_damerau_levenshtein (clev.c:4932)
==169035==    by 0x13493203: __pyx_pf_20weighted_levenshtein_4clev_damerau_levenshtein (clev.c:4810)
==169035==    by 0x13493203: __pyx_pw_20weighted_levenshtein_4clev_1damerau_levenshtein (clev.c:4537)
==169035==    by 0x48DDCD8: UnknownInlinedFun (abstract.h:118)
==169035==    by 0x48DDCD8: UnknownInlinedFun (abstract.h:127)
==169035==    by 0x48DDCD8: UnknownInlinedFun (ceval.c:5077)
==169035==    by 0x48DDCD8: _PyEval_EvalFrameDefault.cold (ceval.c:3520)
==169035==    by 0x4993471: UnknownInlinedFun (pycore_ceval.h:40)
==169035==    by 0x4993471: function_code_fastcall (call.c:330)
==169035==    by 0x48DEA16: UnknownInlinedFun (abstract.h:118)
==169035==    by 0x48DEA16: UnknownInlinedFun (abstract.h:127)
==169035==    by 0x48DEA16: UnknownInlinedFun (ceval.c:5077)
==169035==    by 0x48DEA16: _PyEval_EvalFrameDefault.cold (ceval.c:3489)
==169035==    by 0x498C55E: UnknownInlinedFun (pycore_ceval.h:40)
==169035==    by 0x498C55E: _PyEval_EvalCode (ceval.c:4329)
==169035==    by 0x49931C0: _PyFunction_Vectorcall (call.c:396)
==169035==    by 0x48DE8F6: UnknownInlinedFun (abstract.h:118)
==169035==    by 0x48DE8F6: UnknownInlinedFun (abstract.h:127)
==169035==    by 0x48DE8F6: UnknownInlinedFun (ceval.c:5077)
==169035==    by 0x48DE8F6: _PyEval_EvalFrameDefault.cold (ceval.c:3506)
==169035==    by 0x498C55E: UnknownInlinedFun (pycore_ceval.h:40)
==169035==    by 0x498C55E: _PyEval_EvalCode (ceval.c:4329)
==169035==    by 0x49931C0: _PyFunction_Vectorcall (call.c:396)
==169035==    by 0x498DE9A: UnknownInlinedFun (abstract.h:118)
==169035==    by 0x498DE9A: UnknownInlinedFun (abstract.h:127)
==169035==    by 0x498DE9A: call_function (ceval.c:5077)
==169035==    by 0x48DEE21: _PyEval_EvalFrameDefault.cold (ceval.c:3537)

2) num_rows * num_cols * sizeof(DTYPE_t) can overflow which leads to an incorrect memory allocation and afterwards out of bounds accesses. E.g.

from weighted_levenshtein import dam_lev
a="a"*1518500248
b="b"*1518500248
dam_lev(a, b)

reads out of bound:

==30861== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==30861==  Access not within mapped region at address 0x2E71CED10
==30861==    at 0x1306842B: __pyx_f_20weighted_levenshtein_4clev_c_damerau_levenshtein (clev.c:3444)
==30861==    by 0x1308060D: __pyx_pf_20weighted_levenshtein_4clev_damerau_levenshtein (clev.c:3304)
==30861==    by 0x1308060D: __pyx_pw_20weighted_levenshtein_4clev_1damerau_levenshtein (clev.c:3112)
==30861==    by 0x498F8F0: cfunction_call (methodobject.c:543)
==30861==    by 0x498B947: _PyObject_MakeTpCall (call.c:215)
==30861==    by 0x4988535: UnknownInlinedFun (abstract.h:112)
==30861==    by 0x4988535: UnknownInlinedFun (abstract.h:99)
==30861==    by 0x4988535: UnknownInlinedFun (abstract.h:123)
==30861==    by 0x4988535: UnknownInlinedFun (ceval.c:5891)
==30861==    by 0x4988535: _PyEval_EvalFrameDefault (ceval.c:4213)
==30861==    by 0x4982092: UnknownInlinedFun (pycore_ceval.h:46)
==30861==    by 0x4982092: _PyEval_Vector (ceval.c:5065)
==30861==    by 0x49FDE83: PyEval_EvalCode (ceval.c:1134)
==30861==    by 0x4A2F2B2: run_eval_code_obj (pythonrun.c:1291)
==30861==    by 0x4A2A7D9: run_mod (pythonrun.c:1312)
==30861==    by 0x48FD1CF: pyrun_file.cold (pythonrun.c:1208)
==30861==    by 0x4A24AD8: _PyRun_SimpleFileObject (pythonrun.c:456)
==30861==    by 0x4A24897: _PyRun_AnyFileObject (pythonrun.c:90)