thierry-martinez / pyml

OCaml bindings for Python
BSD 2-Clause "Simplified" License
185 stars 32 forks source link

Segfault in OCaml gc #103

Open nathanfarlow opened 5 days ago

nathanfarlow commented 5 days ago

The folllowing code will segfault on my system.

Python 3.11.4 Ocaml 5.1.1 pyml 20231101

let () =
  Py.initialize ();
  let m =
    Py.Import.exec_code_module_from_string
      ~name:"go.py"
      "import numpy as np\na = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)"
  in
  let a = Py.Module.get m "a" |> Numpy.to_bigarray Float32 C_layout in
  Owl.Dense.Ndarray.S.print a;
  Gc.full_major ()
;;

It dies in the OCaml garbage collector:

(gdb) bt
#0  0x0000555555f84058 in custom_finalize_minor (domain=0x55555628bab0) at runtime/minor_gc.c:694
#1  caml_stw_empty_minor_heap_no_major_slice (domain=0x55555628bab0, participating_count=1, participating=<optimized out>, 
    unused=<optimized out>) at runtime/minor_gc.c:743
#2  0x0000555555f6fb19 in caml_try_run_on_all_domains_with_spin_work (sync=sync@entry=1, 
    handler=handler@entry=0x555555f840a0 <caml_stw_empty_minor_heap>, data=data@entry=0x0, 
    leader_setup=leader_setup@entry=0x555555f82e40 <caml_empty_minor_heap_setup>, 
    enter_spin_callback=enter_spin_callback@entry=0x555555f82ff0 <caml_do_opportunistic_major_slice>, enter_spin_data=enter_spin_data@entry=0x0)
    at runtime/domain.c:1483
#3  0x0000555555f841b2 in caml_try_stw_empty_minor_heap_on_all_domains () at runtime/minor_gc.c:799
#4  caml_empty_minor_heaps_once () at runtime/minor_gc.c:820
#5  0x0000555555f76058 in gc_full_major_exn () at runtime/gc_ctrl.c:269
#6  0x0000555555f76b0b in caml_gc_full_major (v=<optimized out>) at runtime/gc_ctrl.c:283
#7  <signal handler called>
#8  0x00005555557e96d0 in camlDune__exe__Main.entry () at bin/main.ml:15
#9  0x00005555557db0db in caml_program ()
#10 <signal handler called>
#11 0x0000555555f8ecfd in caml_startup_common (pooling=<optimized out>, argv=0x7fffffffd6a8) at runtime/startup_nat.c:132
#12 caml_startup_common (argv=0x7fffffffd6a8, pooling=<optimized out>) at runtime/startup_nat.c:88
#13 0x0000555555f8ed6f in caml_startup_exn (argv=<optimized out>) at runtime/startup_nat.c:139
#14 caml_startup (argv=<optimized out>) at runtime/startup_nat.c:144
#15 caml_main (argv=<optimized out>) at runtime/startup_nat.c:151
#16 0x00005555557da642 in main (argc=<optimized out>, argv=<optimized out>) at runtime/main.c:37

It might be related to this issue given that both code examples use Numpy.to_bigarray.

nathanfarlow commented 23 hours ago

The problem is that numpy_finalize is being called more than once with the same ops ptr, so freeing that many times is an issue. For example,

open! Core

let () =
  Py.initialize ();
  let m =
    Py.Import.exec_code_module_from_string
      ~name:"go.py"
      "import numpy as np\na = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)"
  in
  let big_array = Py.Module.get m "a" |> Numpy.to_bigarray Float32 C_layout in
  let arr = List.init 10 ~f:(Fn.const big_array) in
  List.iter arr ~f:Owl.Dense.Ndarray.S.print;
  Gc.full_major ()
,;;

will cause numpy_finalize to be called 11 times, each with a different v, but Custom_ops_val(v) is the same across calls.

nathanfarlow commented 23 hours ago

Ahah, in bigarray.c, certain operations will copy the Custom_ops_val. One example is in caml_ba_slice.

98 mentions that slicing is bugged, which makes sense.

nathanfarlow commented 21 hours ago

I'm not sure how to fix this. It seems like we need to do some ref counting in the finalizer unless we know that the subarrays are dead when the original array is finalized. From skimming bigarray.c, I couldn't conclude that since the subarrays don't hold a reference to the original array. In the refcounting case, we could decrement in the finalizer, but I'm not sure where we'd increment the refcount without modifying bigarray.c.