rdicosmo / parmap

Parmap is a minimalistic library allowing to exploit multicore architecture for OCaml programs with minimal modifications.
http://rdicosmo.github.io/parmap/
Other
94 stars 20 forks source link

Fatal error: exception Failure("input_value_from_block: bad object") #18

Closed UnixJunkie closed 9 years ago

UnixJunkie commented 10 years ago

Anyone saw this before?

I have some code using Parmap that crashes in parallel but not in sequential (no Parmap).

rdicosmo commented 9 years ago

Closing this for now, as no new reports confimr the finding, feel free to reopen if more info is available

nilsbecker commented 7 years ago

i get this same error: sequential works but par_map raises it. the function that is mapped uses both Lacaml and Gsl. i tried to make sure all Bigarrays and c-side structures that are used are created new inside the loop but i was not able to get it to work so far. is there something to be aware of in particular?

rdicosmo commented 7 years ago

Hmmm... are you running this on 32 bit architectures by any chance?

On Fri, Feb 03, 2017 at 06:03:35AM -0800, nilsbecker wrote:

i get this same error: sequential works but par_map raises it. the function that is mapped uses both Lacaml and Gsl. i tried to make sure all Bigarrays and c-side structures that are used are created new inside the loop but i was not able to get it to work so far. is there something to be aware of in particular?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF E-mail : roberto@dicosmo.org Universite Paris Diderot Web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann
F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon


GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

no, macos, standard amd64. actually, i am not 100% sure there are no objects created by gsl before the loop referenced from the loop via some closures. but also, i don't really know if it should matter?

nilsbecker commented 7 years ago

this happens from bytecode in case that's important.

rdicosmo commented 7 years ago

AFAIR, this exception comes from the marshalling layer that reads the data shipped from/to main to/from workers in Parmap.

Did you try a simple standard test on your machine? In the testing directory there are quite a fer... do a make test and then run some of them... we'll see if this comes from your code or from Parmap on your configuration.

On Fri, Feb 03, 2017 at 06:14:03AM -0800, nilsbecker wrote:

no, macos, standard amd64. actually, i am not 100% sure there are no objects created by gsl before the loop referenced from the loop via some closures. but also, i don't really know if it should matter?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF E-mail : roberto@dicosmo.org Universite Paris Diderot Web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann
F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon


GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

ok. so the tests run fine. they are native code. i then tried my program in native code -- and it runs through successfully! next i tried to compile my program to bytecode -- it runs also! next the bare ocamlc toplevel -- runs as well!! so it's utop again!

i was thinking the incompatibility was fully resolved but alas, not really. using parmap 1.0-rc7.1

UnixJunkie commented 7 years ago

@nilsbecker can you report it to the utop maintainers with a reproducing case, thanks!

nilsbecker commented 7 years ago

to make sure i re-ran the whole thing on utop once more. now it runs and i feel stupid. i swear i got the exception repeatedly before - no idea where it came from then. it was from a longer interactive session - maybe something got messed up along the way. sorry for the noise. i also tried some of the tests in the distribution by #useing them from utop -- this worked as well.

nilsbecker commented 7 years ago

incidentally, it seems that the version of parmap on opam does not yet include the fix for utop -- is this correct? i unpinned the git version, reinstalled and parmap hangs in utop still.

nilsbecker commented 7 years ago

hi again. i now have a reproducible case in which i get this exception, in compiled native code as well as bytecode, and from utop, on mac os x, 64bit.

in the parallelized loop lacaml is used to construct and manipulate matrices, and gsl to do some minimization involving the matrices. whenever i run the loop in parallel with matrix dimensions bigger than a certain threshold (roughly 256x256), i get the exception shown in the title - for smaller matrices i do not get it. running the code sequentially always works. since lacaml calls the Accelerate-optimized BLAS in mac OS X, the sequential code already uses two of the four cores.

i don't really know how to make a minimal example out of this though, since it requires specific parameter settings in my somewhat long program.

rdicosmo commented 7 years ago

Hi Nils, does this happen only inside utop, or also without utop?

On Sun, Feb 26, 2017 at 01:30:15PM -0800, nilsbecker wrote:

hi again. i now have a reproducible case in which i get this exception, in compiled native code as well as bytecode, and from utop.

in this program, i use lacaml to construct and manipulate matrices, and gsl to do some minimization involving the matrices. whenever i run the loop in parallel with matrix dimensions bigger than a certain threshold (roughly 256x256), i get the exception shown in the title - for smaller matrices i do not get it. running the code sequentially always works. since lacaml calls the Accelerate-optimized BLAS in mac OS X, the sequential code already uses two of the four cores.

i don't really know how to make a minimal example out of this though, since it requires specific parameter settings in my somewhat long program.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF E-mail : roberto@dicosmo.org Universite Paris Diderot Web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann
F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon


GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

no, this time it's independent of utop. native, bytecode or utop.

UnixJunkie commented 7 years ago

This could be due to a change in the Marshal module. Previously, Marshal would raise End_of_file more easily. Nowadays, it raises Failure sometimes. Can you try with different versions of OCaml (opam switch command)? To see if starting at some previous version your bug disappears.

nilsbecker commented 7 years ago

in what version would you suspect a change? (i am probably using some newish features but 4.02.1 may work)

UnixJunkie commented 7 years ago

The INRIA ocaml mantis search is too dumb. I can't find back. But: make sure that 4.02.1 work, then use opam switch to go to the version just after it, until you find your bug. This change in Marshal's behaviour has hit me so I know about it.

nilsbecker commented 7 years ago

hi, i checked and 4.02.1 already shows the same behavior unfortunately.

UnixJunkie commented 7 years ago

Try to go back a few versions earlier. Use git bissect if you want to minimize your number of tests.

UnixJunkie commented 7 years ago

I would suspect 4.01.0 had the right unmarshalling behavior.

nilsbecker commented 7 years ago

nope, 4.01.0 throws the same exception for me. i am not sure if i'm not violating some assumptions made by parmap, since i use external objects from C via gsl. but the fact that smaller matrix sizes work and give reasonable results suggests that i am not doing something terrible.

rdicosmo commented 7 years ago

Hmm... this might be an issue related to the size of the elements passed along by parmap, see the comments above the lines https://github.com/rdicosmo/parmap/blob/master/parmap.ml#L129-L131

You may try the alternative code that was commented out in those lines, and if it works, than that will have nailed the bug.

On Mon, Feb 27, 2017 at 07:09:16AM -0800, nilsbecker wrote:

nope, 4.01.0 throws the same exception for me. i am not sure if i'm not violating some assumptions made by parmap, since i use external objects from C via gsl. but the fact that smaller matrix sizes work and give reasonable results suggests that i am not doing something terrible.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF E-mail : roberto@dicosmo.org Universite Paris Diderot Web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann
F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon


GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

i looked at it - since i'm on os x i'm not sure there is code that i can simply uncomment and use? or can i?

rdicosmo commented 7 years ago

Well, no, on Mac OSX any big file allocation would actually fill up the virtual memory (and your swap space) :-(

2017-02-28 11:01 GMT+01:00 nilsbecker notifications@github.com:

i looked at it - since i'm on os x i'm not sure there is code that i can simply uncomment and use? or can i?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rdicosmo/parmap/issues/18#issuecomment-282995926, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp-v_b6s4jNZ6HrBUVSZxSNLevuE6lEks5rg_CNgaJpZM4BCVhm .

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF email : roberto@dicosmo.org Universite Paris Diderot web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo 5, Rue Thomas Mann F-75205 Paris Cedex 13 FRANCE

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon

GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

about memory: i'm now on a machine with 32G ram and for the parameter settings that still work, i have about 500M per thread (i have 8 of them running).

nilsbecker commented 7 years ago

an update: i now tried the same thing on a linux machine with 16GB of ram. at parameter settings where OS X throws an exception, the program runs through on linux (!) this is for a 512x512 matrix used internally in each thread.

it's possible this is an os x thing. will try with bigger matrices on linux to see if i can make it fail.

rdicosmo commented 7 years ago

Ok, that's great to know :-)

On Wed, Mar 01, 2017 at 12:07:36PM -0800, nilsbecker wrote:

an update: i now tried the same thing on a linux machine with 16GB of ram. at parameter settings where OS X throws an exception, the program runs through on linux (!) this is for a 512x512 matrix used internally in each thread.

it's possible this is an os x thing. will try with bigger matrices on linux to see if i can make it fail.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF E-mail : roberto@dicosmo.org Universite Paris Diderot Web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo
5, Rue Thomas Mann
F-75205 Paris Cedex 13 France

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon


GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

i confirmed that linux does not throw the exception also with bigger memory requirements (16 times what i had before). it seems that the issue i am seeing is specific to os x. just to be sure, in fact the list of results that is finally returned by parmap is not big. big memory allocation happens only internally in each parallel thread.

rdicosmo commented 7 years ago

Thanks Nils, unfortunately, I do not have an OS X to test, so I cannot be of big help here... I wonder if you can use some tools (strace/ptrace) to see what exactly happens when the exception is triggered, or whether other people here with OS X may lend a hand.

2017-03-02 11:37 GMT+01:00 nilsbecker notifications@github.com:

i confirmed that linux does not throw the exception also with bigger memory requirements (16 times what i had before). it seems that the issue i am seeing is specific to os x. just to be sure, in fact the list of results that is finally returned by parmap is not big. big memory allocation happens only internally in each parallel thread.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rdicosmo/parmap/issues/18#issuecomment-283618026, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp-v54gGp-K7D0rvOmSZ_G--DzU9q97ks5rhpvTgaJpZM4BCVhm .

-- Roberto Di Cosmo


Professeur (on leave at/detache a INRIA) IRIF email : roberto@dicosmo.org Universite Paris Diderot web : http://www.dicosmo.org Case 7014 Twitter : http://twitter.com/rdicosmo 5, Rue Thomas Mann F-75205 Paris Cedex 13 FRANCE

Office location:

Paris Diderot INRIA

Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42

Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon

GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3

nilsbecker commented 7 years ago

i could try but i don't have much experience. last i tried native debugging of ocaml on os x it was discouraging...

nilsbecker commented 7 years ago

ok, i have a backtrace now at least from running with OCAMLRUNPARAM=b

Fatal error: exception Failure("input_value_from_block: bad object")
Raised by primitive operation at file "bytearray.ml", line 102, characters 0-68
Called from file "parmap.ml", line 99, characters 11-34
Called from file "parmap.ml", line 209, characters 13-46
Called from file "parmap.ml" (inlined), line 530, characters 4-65
Called from file "test_like.ml" (inlined), line 557, characters 17-40
Called from file "test_like.ml", line 559, characters 2-244
Called from file "test_like.ml", line 570, characters 2-17
nilsbecker commented 7 years ago

i would be open to try debugging this more if someone can give some guidance how to do it on os x. in that case it would probably best to open a new issue

nilsbecker commented 7 years ago

i found this:

https://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html

which may indicate that the optimized blas/lapack on os x is in conflict with what parmap does to parallelize. if so that would be unfortunate... then the only fix would be to link lacaml to some alternative blas/lapack which could be a pain...

JuliaLawall commented 3 years ago

A colleague is getting either End of File or Failure("input_value_from_block: bad object") in code that uses parmap. He has: OS: 18.04.5 LTS (Bionic Beaver) ocaml: 4.05.0 parmap: 1.0 rc8-1build1 He is running the code in docker. The code has a long running time. The data may be very large.

UnixJunkie commented 3 years ago

@JuliaLawall a workaround idea' if he is using Parmap.parmap, try using the parany library instead and Parany.Parmap.parmap. If he is not using Parmap.parmap, then he will have to use the more generic Parany.run directly. Parany is designed to work with potentially infinite lists of things to compute.

JuliaLawall commented 3 years ago

On Fri, 5 Feb 2021, Francois Berenger wrote:

@JuliaLawall a workaround idea' if he is using Parmap.parmap, try using the parany library instead and Parany.Parmap.parmap. If he is not using Parmap.parmap, then he will have to use the more generic Parany.run directly. Parany is designed to work with potentially infinite lists of things to compute.

Thanks. I don't know if the problem is the number of things to compute or their size.

julia

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AAD2ZGSEFG7SQDV3H7QT443S5SS2PA5CNFSM4AIJLBTKYY3PNVWWK3TUL52HS4 DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFYUCJKY.gif]

rdicosmo commented 3 years ago

A colleague is getting either End of File or Failure("input_value_from_block: bad object") in code that uses parmap. He has: OS: 18.04.5 LTS (Bionic Beaver) ocaml: 4.05.0 parmap: 1.0 rc8-1build1 He is running the code in docker. The code has a long running time. The data may be very large.

Hi Julia, the errors you mention point to data corruption when transferring values between different processes: this may be caused by very large data being shuffled around. It is difficult to do something to help your colleague without a MWE, though ...

JuliaLawall commented 3 years ago

On Sat, 6 Feb 2021, Roberto Di Cosmo wrote:

  A colleague is getting either End of File or
  Failure("input_value_from_block: bad object") in code that uses
  parmap. He has:
  OS: 18.04.5 LTS (Bionic Beaver)
  ocaml: 4.05.0
  parmap: 1.0 rc8-1build1
  He is running the code in docker. The code has a long running
  time. The data may be very large.

Hi Julia, the errors you mention point to data corruption when transferring values between different processes: this may be caused by very large data being shuffled around. It is difficult to do something to help your colleague without a MWE, though ...

I guess MWE is minimum working example? I suspect that minimizing would cause the problem to disappear. We may be able to avoid the problem by dropping unnecessary data. I mostly wanted to record in the discussion that the problem is not specific to MacOS.

julia