rdicosmo / parmap

Parmap is a minimalistic library allowing to exploit multicore architecture for OCaml programs with minimal modifications.
http://rdicosmo.github.io/parmap/
Other
94 stars 20 forks source link

seg faults #96

Closed JuliaLawall closed 3 years ago

JuliaLawall commented 4 years ago

I have the impression that since some recent version of Parmap, if the underlying OCaml code raises an exception, then Parmap seg faults, which is not user friendly. Has something changed in Parmap recently that would cause this issue? I will try to make a test case shortly.

JuliaLawall commented 4 years ago

Here is a test program:

let _ =
  Parmap.parmap
    (function x ->
      failwith "should not crash")
    (Parmap.L [1;2;3;4;5])

I am using parmap from opam, 4.07.0

UnixJunkie commented 4 years ago

As a workaround, you can try this:

let robust_parmap ncores chunksize f l =
  Parmap.parmap ~ncores ~chunksize (function x ->
    try Some (f x)
    with _exn -> None (* honestly, you should properly log the exact exception *)
    ) (Parmap.L l)

There is a former MIT student who created an ocaml library just because he was not happy with the way parmap was dealing with exception-raising code... I don't think you can expect things to work fine if you have parallel code throwing an exception inside of the parallel section; hence the try-catch.

UnixJunkie commented 4 years ago

@Drup, who is not watching the project, has committed some changes in the code recently.

JuliaLawall commented 4 years ago

On Sun, 14 Jun 2020, Francois Berenger wrote:

As a workaround, you can try this:

let robust_parmap ncores chunksize f l = Parmap.parmap ~ncores ~chunksize (function x -> try Some (f x) with _exn -> None ( honestly, you should properly log the exact excepti on ) ) (Parmap.L l)

There is a former MIT student who created an ocaml library just because he was not happy with the way parmap was dealing with exception-raising code... I don't think you can expect things to work fine if you have parallel code throwing an exception inside of the parallel section; hence the try-catch.

I believe that it worked previously, where "worked" means didn't seg fault. To my recollection you could see the messages. I'll try to confirm this.

julia

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, orunsubscribe.[AAD2ZGWY3EFNNVK3BBZ5VNTRWV3FLA5CNFSM4N5KWLSKYY3PNVWWK3TUL52HS4 DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZQH6BA.gif]

UnixJunkie commented 4 years ago

I also get the core dumped with parmap.

let _ =
  Parmap.parmap
    (function x ->
      failwith "should not crash")
    (Parmap.L [1;2;3;4;5]);;
[Parmap]: error at index j=0 in (0,0), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 0 

[Parmap]: error at index j=0 in (1,1), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 1 

[Parmap]: error at index j=0 in (2,2), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 2 

[Parmap]: error at index j=0 in (3,3), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 3 

[Parmap]: error at index j=0 in (4,4), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 4 

Segmentation fault (core dumped)
UnixJunkie commented 4 years ago

Note that with Parany, the behavior would not be very better: https://github.com/UnixJunkie/parany/issues/17 (infinite loop)

JuliaLawall commented 4 years ago

On Sun, 14 Jun 2020, Francois Berenger wrote:

Note that with Parany, the behavior would not be very better: UnixJunkie/parany#17 (infinite loop)

On a machine with an old version of ocaml I appear to have the following version of parmap:

version = "1.0-rc7" name = "parmap" description = "Parallel map primitive for multicores" requires = "bigarray unix" archive(byte) = "parmap.cma" archive(native) = "parmap.cmxa" linkopts(byte) = "-cclib -lparmap_stubs"

This version gives:

[Parmap]: error at index j=0 in (0,0), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 0

[Parmap]: error at index j=0 in (1,1), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 1

[Parmap]: error at index j=0 in (2,2), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 2

[Parmap]: error at index j=0 in (4,4), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 4

[Parmap]: error at index j=0 in (3,3), chunksize=1 of a total of 1 got exception Failure("should not crash") on core 3

Fatal error: exception Failure("input_value_from_block: bad object") Raised by primitive operation at unknown location Called from unknown location Called from unknown location Called from file "partest.ml", line 2, characters 2-96


So the program still crashes, but one can see what the problem is immediately.

julia

UnixJunkie commented 4 years ago

The parmap maintainer should try to git bisect from this 1.0-rc7 version using your example program.

rdicosmo commented 4 years ago

Here is a test program:

let _ =
  Parmap.parmap
    (function x ->
      failwith "should not crash")
    (Parmap.L [1;2;3;4;5])

I am using parmap from opam, 4.07.0

Thanks Julia for providing this example: this indeed is not normal behaviour.

rdicosmo commented 3 years ago

After closer investigation, it seems that this change in behaviour comes from differences in the way recent versions of OCaml check the data for integrity when unmarshaling from a block. In previous versions one could find:

if (magic != Intext_magic_number) 
    failwith("input_value_from_block: bad object");

that led to the exception captured at OCaml level.

More recent versions of the compiler seem to skip at least this part of the test, which may explain the SIGSEGV.

Nonetheless, the right way to fix this is to inform the master process that a failure occurred, and prevent any attempt to unmarshal data that will be anyway corrupt or incomplete. This logic is already implemented in the "mapper" backend, but not in the "simplemapper" backend (which is the one used in the example from @JuliaLawall ).

This goes on my TODO list :-)

rdicosmo commented 3 years ago

This is now fixed in commit f140dbc8b05aa3d341c70436a1920a06df9a0ed4 This is now tagged as 1.2.3 and the corresponding opam package is submitted to opam-repository. Thanks to @JuliaLawall for reporting this issue.