thierry-martinez / pyml

OCaml bindings for Python
BSD 2-Clause "Simplified" License
182 stars 31 forks source link

Segfault on Numpy.to_bigarray with 20210226 but not with 20200518 #64

Closed richardalligier closed 2 years ago

richardalligier commented 3 years ago

Hello,

On the Python side I have pythonmodule.py:

import numpy as np

def predict():
    res = np.arange(10,dtype=float)
    return res

and on OCaml side I have:

Py.initialize();;
let ()=
 let layout = Bigarray.C_layout in
 let elttype= Bigarray.Float64 in
 let pythonmodule=Py.import "pythonmodule" in
 let evalbigarray () =
     let res = Py.Module.get_function pythonmodule "predict" [| |] in
     Numpy.to_bigarray elttype layout res
 in
 for i=0 to 10000 do
    ignore(evalbigarray());
 done;
 ()

I have a segfault around the 500th iteration with 20210226 but no segfault at all just using opam install pyml=20200518

Best, Richard

Lupus commented 3 years ago

I'm also running into segmentation fault with code involving to_bigarray over multiple iterations...

Lupus commented 3 years ago

This is what Valgrind says:

==1681093== Invalid free() / delete / delete[] / realloc()
==1681093==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1681093==    by 0x28D764BD: caml_empty_minor_heap (minor_gc.c:413)
==1681093==    by 0x28D7692F: caml_gc_dispatch (minor_gc.c:492)
==1681093==    by 0x28D76A79: caml_alloc_small_dispatch (minor_gc.c:539)
==1681093==    by 0x28D77FA8: caml_alloc_small (alloc.c:68)
==1681093==    by 0x28D8B3AB: alloc_custom_gen (custom.c:50)
==1681093==    by 0x28D8B5DF: caml_alloc_custom_mem (custom.c:106)
==1681093==    by 0x28D8D991: caml_ba_alloc (bigarray.c:116)
==1681093==    by 0x28D531B5: bigarray_of_pyarray_wrapper (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x28C080D4: camlNumpy__to_bigarray_420 (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x282D9FD5: camlDune__exe__Ocaml__of_float_numpy_1959 (ocaml.re:82)
==1681093==    by 0x282D9AE0: camlDune__exe__Ocaml__request_of_self_1878 (ocaml.re:101)
==1681093==  Address 0x1aa861f0 is 0 bytes inside a block of size 72 free'd
==1681093==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1681093==    by 0x28D764BD: caml_empty_minor_heap (minor_gc.c:413)
==1681093==    by 0x28D7692F: caml_gc_dispatch (minor_gc.c:492)
==1681093==    by 0x28D76A79: caml_alloc_small_dispatch (minor_gc.c:539)
==1681093==    by 0x28D77FA8: caml_alloc_small (alloc.c:68)
==1681093==    by 0x28D8B3AB: alloc_custom_gen (custom.c:50)
==1681093==    by 0x28D8B5DF: caml_alloc_custom_mem (custom.c:106)
==1681093==    by 0x28D8D991: caml_ba_alloc (bigarray.c:116)
==1681093==    by 0x28D531B5: bigarray_of_pyarray_wrapper (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x28C080D4: camlNumpy__to_bigarray_420 (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x282D9FD5: camlDune__exe__Ocaml__of_float_numpy_1959 (ocaml.re:82)
==1681093==    by 0x282D9AE0: camlDune__exe__Ocaml__request_of_self_1878 (ocaml.re:101)
==1681093==  Block was alloc'd at
==1681093==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1681093==    by 0x28D531E0: bigarray_of_pyarray_wrapper (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x28C080D4: camlNumpy__to_bigarray_420 (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x282D9FD5: camlDune__exe__Ocaml__of_float_numpy_1959 (ocaml.re:82)
==1681093==    by 0x282D9AAD: camlDune__exe__Ocaml__request_of_self_1878 (ocaml.re:97)
==1681093==    by 0x282DA068: camlDune__exe__Ocaml__init_2371 (ocaml.re:156)
==1681093==    by 0x282DA19E: camlDune__exe__Ocaml__fun_3653 (ocaml.re:184)
==1681093==    by 0x28C172F2: camlPy__handle_errors_3584 (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x28D91BFB: caml_start_program (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)
==1681093==    by 0x28D879F3: caml_callback_exn (callback.c:111)
==1681093==    by 0x28D87C6C: caml_callback (callback.c:165)
==1681093==    by 0x28D56421: pycall_callback (in /home/kolkhovskiy/algotrading/_build/default/python/ocaml.so)

I'm using Python 3.8.5, OCaml 4.12.0 and pyml 20210226.

Lupus commented 3 years ago

Interestingly enough it still segfaults after I do opam install pyml=20200518...

pkel commented 3 years ago

I can reproduce this bug on ocaml-variants.4.11.1+flambda and pyml 20210226

==22208== Memcheck, a memory error detector
==22208== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==22208== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==22208== Command: _build/default/segfault.exe
==22208== 
==22208== Invalid write of size 8
==22208==    at 0x553B021: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x57CA4D: caml_empty_minor_heap (minor_gc.c:409)
==22208==    by 0x57CE7B: caml_gc_dispatch (minor_gc.c:475)
==22208==    by 0x57CFF4: caml_alloc_small_dispatch (minor_gc.c:531)
==22208==    by 0x57E5A0: caml_alloc_small (alloc.c:68)
==22208==    by 0x590E26: alloc_custom_gen (custom.c:49)
==22208==    by 0x59325B: caml_ba_alloc (bigarray.c:116)
==22208==    by 0x55D798: bigarray_of_pyarray_wrapper (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E2C26: camlNumpy__to_bigarray_250 (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E275F: camlDune__exe__Segfault__entry (segfault.ml:8)
==22208==    by 0x4E04B8: caml_program (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x596ABF: caml_start_program (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==22208== 
==22208== Invalid free() / delete / delete[] / realloc()
==22208==    at 0x48430E4: free (vg_replace_malloc.c:755)
==22208==    by 0x5490F21: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x553B0E0: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x57CA4D: caml_empty_minor_heap (minor_gc.c:409)
==22208==    by 0x57CE7B: caml_gc_dispatch (minor_gc.c:475)
==22208==    by 0x57CFF4: caml_alloc_small_dispatch (minor_gc.c:531)
==22208==    by 0x57E5A0: caml_alloc_small (alloc.c:68)
==22208==    by 0x590E26: alloc_custom_gen (custom.c:49)
==22208==    by 0x59325B: caml_ba_alloc (bigarray.c:116)
==22208==    by 0x55D798: bigarray_of_pyarray_wrapper (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E2C26: camlNumpy__to_bigarray_250 (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E275F: camlDune__exe__Segfault__entry (segfault.ml:8)
==22208==  Address 0x12d10180 is 48 bytes inside a block of size 4,345 alloc'd
==22208==    at 0x484086F: malloc (vg_replace_malloc.c:380)
==22208==    by 0x548D91E: PyObject_Malloc (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x548EE25: PyUnicode_New (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x54BE55D: PyUnicode_Substring (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x54B3D62: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x549EB50: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x549D524: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x54AB28D: _PyFunction_Vectorcall (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x549E8EA: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x549D524: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x5519ED4: _PyEval_EvalCodeWithName (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x5519E6C: PyEval_EvalCodeEx (in /usr/lib64/libpython3.9.so.1.0)
==22208== 
==22208== Invalid free() / delete / delete[] / realloc()
==22208==    at 0x48430E4: free (vg_replace_malloc.c:755)
==22208==    by 0x5490F21: ??? (in /usr/lib64/libpython3.9.so.1.0)
==22208==    by 0x57CA4D: caml_empty_minor_heap (minor_gc.c:409)
==22208==    by 0x57CE7B: caml_gc_dispatch (minor_gc.c:475)
==22208==    by 0x57CFF4: caml_alloc_small_dispatch (minor_gc.c:531)
==22208==    by 0x57E5A0: caml_alloc_small (alloc.c:68)
==22208==    by 0x590E26: alloc_custom_gen (custom.c:49)
==22208==    by 0x59325B: caml_ba_alloc (bigarray.c:116)
==22208==    by 0x55D798: bigarray_of_pyarray_wrapper (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E2C26: camlNumpy__to_bigarray_250 (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==    by 0x4E275F: camlDune__exe__Segfault__entry (segfault.ml:8)
==22208==    by 0x4E04B8: caml_program (in /home/patrik/devel/consensus-protocol-research/segfault/_build/default/segfault.exe)
==22208==  Address 0x14877a10 is in the Data segment of /usr/lib64/python3.9/site-packages/numpy/core/_multiarray_umath.cpython-39-x86_64-linux-gnu.so
==22208== 
==22208== 
==22208== HEAP SUMMARY:
==22208==     in use at exit: 18,692,068 bytes in 100,819 blocks
==22208==   total heap usage: 426,333 allocs, 325,516 frees, 70,141,136 bytes allocated
==22208== 
==22208== LEAK SUMMARY:
==22208==    definitely lost: 2,150 bytes in 22 blocks
==22208==    indirectly lost: 520 bytes in 8 blocks
==22208==      possibly lost: 4,484,877 bytes in 31,107 blocks
==22208==    still reachable: 14,204,521 bytes in 69,682 blocks
==22208==                       of which reachable via heuristic:
==22208==                         newarray           : 432 bytes in 27 blocks
==22208==         suppressed: 0 bytes in 0 blocks
==22208== Rerun with --leak-check=full to see details of leaked memory
==22208==
==22208== For lists of detected and suppressed errors, rerun with: -s
==22208== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

After opam install pyml=20200518 and recompilation, the program does not segfault any more.

pkel commented 3 years ago

Not sure whether it's useful, but in my application I get a different report. Invalid read instead of invalid write/free.

==12390== Memcheck, a memory error detector
==12390== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12390== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==12390== Command: /home/patrik/devel/consensus-protocol-research/_venv/bin/pytest gym/tests/test_specs.py
==12390== 
=========================================================== test session starts ============================================================
platform linux -- Python 3.9.6, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/patrik/devel/consensus-protocol-research/python/gym
plugins: forked-1.3.0
collected 4 items                                                                                                                          

gym/tests/test_specs.py ..==12390== Invalid read of size 8
==12390==    at 0x4A1F019: ??? (in /usr/lib64/libpython3.9.so.1.0)
==12390==    by 0x23942F9D: caml_empty_minor_heap (minor_gc.c:409)
==12390==    by 0x23943384: caml_gc_dispatch (minor_gc.c:475)
==12390==    by 0x23943501: caml_alloc_small_dispatch (minor_gc.c:531)
==12390==    by 0x2395BEB4: caml_call_gc (in /home/patrik/devel/consensus-protocol-research/python/gym/cpr_gym/bridge.so)
==12390==    by 0x238C611D: camlStdlib__list__find_1125 (list.ml:58)
==12390==    by 0x23844DC1: camlCpr_lib__Dag__anon_fn$5bdag$2eml$3a174$2c20$2d$2d192$5d_989 (dag.ml:176)
==12390==    by 0x238C3D77: camlStdlib__option__map_104 (option.ml:24)
==12390==    by 0x238C3C15: camlStdlib__seq__unfold_256 (seq.ml:83)
==12390==    by 0x238C3841: camlStdlib__seq__filter_map_104 (seq.ml:39)
==12390==    by 0x236278F8: camlDune__exe__Definitions__iter_374 (definitions.ml:145)
==12390==    by 0x23627438: camlDune__exe__Definitions__step_248 (definitions.ml:158)
==12390==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==12390== 
==12390== 
==12390== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==12390==  Access not within mapped region at address 0x8
==12390==    at 0x4A1F019: ??? (in /usr/lib64/libpython3.9.so.1.0)
==12390==    by 0x23942F9D: caml_empty_minor_heap (minor_gc.c:409)
==12390==    by 0x23943384: caml_gc_dispatch (minor_gc.c:475)
==12390==    by 0x23943501: caml_alloc_small_dispatch (minor_gc.c:531)
==12390==    by 0x2395BEB4: caml_call_gc (in /home/patrik/devel/consensus-protocol-research/python/gym/cpr_gym/bridge.so)
==12390==    by 0x238C611D: camlStdlib__list__find_1125 (list.ml:58)
==12390==    by 0x23844DC1: camlCpr_lib__Dag__anon_fn$5bdag$2eml$3a174$2c20$2d$2d192$5d_989 (dag.ml:176)
==12390==    by 0x238C3D77: camlStdlib__option__map_104 (option.ml:24)
==12390==    by 0x238C3C15: camlStdlib__seq__unfold_256 (seq.ml:83)
==12390==    by 0x238C3841: camlStdlib__seq__filter_map_104 (seq.ml:39)
==12390==    by 0x236278F8: camlDune__exe__Definitions__iter_374 (definitions.ml:145)
==12390==    by 0x23627438: camlDune__exe__Definitions__step_248 (definitions.ml:158)
==12390==  If you believe this happened as a result of a stack
==12390==  overflow in your program's main thread (unlikely but
==12390==  possible), you can try to increase the size of the
==12390==  main thread stack using the --main-stacksize= flag.
==12390==  The main thread stack size used in this run was 8388608.
==12390== 
==12390== Process terminating with default action of signal 11 (SIGSEGV)
==12390==  General Protection Fault
==12390==    at 0x4DACC82: __pthread_once_slow (in /usr/lib64/libpthread-2.33.so)
==12390==    by 0x4CFD03E: __rpc_thread_variables.part.0 (in /usr/lib64/libc-2.33.so)
==12390==    by 0x4D3F61C: free_mem (in /usr/lib64/libc-2.33.so)
==12390==    by 0x4D3F271: __libc_freeres (in /usr/lib64/libc-2.33.so)
==12390==    by 0x48351E7: _vgnU_freeres (vg_preloaded.c:74)
==12390== 
==12390== HEAP SUMMARY:
==12390==     in use at exit: 28,276,210 bytes in 172,014 blocks
==12390==   total heap usage: 859,857 allocs, 687,843 frees, 153,403,535 bytes allocated
==12390== 
==12390== LEAK SUMMARY:
==12390==    definitely lost: 2,296 bytes in 21 blocks
==12390==    indirectly lost: 520 bytes in 8 blocks
==12390==      possibly lost: 7,156,427 bytes in 58,974 blocks
==12390==    still reachable: 21,116,967 bytes in 113,011 blocks
==12390==                       of which reachable via heuristic:
==12390==                         newarray           : 512 bytes in 32 blocks
==12390==         suppressed: 0 bytes in 0 blocks
==12390== Rerun with --leak-check=full to see details of leaked memory
==12390== 
==12390== For lists of detected and suppressed errors, rerun with: -s
==12390== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
[1]    12390 segmentation fault (core dumped)  valgrind pytest gym/tests/test_specs.py
thierry-martinez commented 2 years ago

Very sorry for the very (very too!) late answer... but this should be fixed now! Thank you very much for your report, that helped a lot for bisecting. This issue was more general than just to_bigarray (the Numpy array type object was stolen from Python by the OCaml GC when accessed), and the fix should solve other instabilities linked with Numpy as well.