Closed UnixJunkie closed 9 years ago
Closing this for now, as no new reports confimr the finding, feel free to reopen if more info is available
i get this same error: sequential works but par_map
raises it. the function that is mapped uses both Lacaml and Gsl. i tried to make sure all Bigarrays and c-side structures that are used are created new inside the loop but i was not able to get it to work so far. is there something to be aware of in particular?
Hmmm... are you running this on 32 bit architectures by any chance?
On Fri, Feb 03, 2017 at 06:03:35AM -0800, nilsbecker wrote:
i get this same error: sequential works but par_map raises it. the function that is mapped uses both Lacaml and Gsl. i tried to make sure all Bigarrays and c-side structures that are used are created new inside the loop but i was not able to get it to work so far. is there something to be aware of in particular?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
no, macos, standard amd64. actually, i am not 100% sure there are no objects created by gsl before the loop referenced from the loop via some closures. but also, i don't really know if it should matter?
this happens from bytecode in case that's important.
AFAIR, this exception comes from the marshalling layer that reads the data shipped from/to main to/from workers in Parmap.
Did you try a simple standard test on your machine? In the testing directory there are quite a fer... do a make test and then run some of them... we'll see if this comes from your code or from Parmap on your configuration.
On Fri, Feb 03, 2017 at 06:14:03AM -0800, nilsbecker wrote:
no, macos, standard amd64. actually, i am not 100% sure there are no objects created by gsl before the loop referenced from the loop via some closures. but also, i don't really know if it should matter?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
ok. so the tests run fine. they are native code. i then tried my program in native code -- and it runs through successfully! next i tried to compile my program to bytecode -- it runs also! next the bare ocamlc toplevel -- runs as well!! so it's utop again!
i was thinking the incompatibility was fully resolved but alas, not really. using parmap 1.0-rc7.1
@nilsbecker can you report it to the utop maintainers with a reproducing case, thanks!
to make sure i re-ran the whole thing on utop once more. now it runs and i feel stupid. i swear i got the exception repeatedly before - no idea where it came from then. it was from a longer interactive session - maybe something got messed up along the way. sorry for the noise. i also tried some of the tests in the distribution by #use
ing them from utop -- this worked as well.
incidentally, it seems that the version of parmap on opam does not yet include the fix for utop -- is this correct? i unpinned the git version, reinstalled and parmap hangs in utop still.
hi again. i now have a reproducible case in which i get this exception, in compiled native code as well as bytecode, and from utop, on mac os x, 64bit.
in the parallelized loop lacaml is used to construct and manipulate matrices, and gsl to do some minimization involving the matrices. whenever i run the loop in parallel with matrix dimensions bigger than a certain threshold (roughly 256x256), i get the exception shown in the title - for smaller matrices i do not get it. running the code sequentially always works. since lacaml calls the Accelerate-optimized BLAS in mac OS X, the sequential code already uses two of the four cores.
i don't really know how to make a minimal example out of this though, since it requires specific parameter settings in my somewhat long program.
Hi Nils, does this happen only inside utop, or also without utop?
On Sun, Feb 26, 2017 at 01:30:15PM -0800, nilsbecker wrote:
hi again. i now have a reproducible case in which i get this exception, in compiled native code as well as bytecode, and from utop.
in this program, i use lacaml to construct and manipulate matrices, and gsl to do some minimization involving the matrices. whenever i run the loop in parallel with matrix dimensions bigger than a certain threshold (roughly 256x256), i get the exception shown in the title - for smaller matrices i do not get it. running the code sequentially always works. since lacaml calls the Accelerate-optimized BLAS in mac OS X, the sequential code already uses two of the four cores.
i don't really know how to make a minimal example out of this though, since it requires specific parameter settings in my somewhat long program.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
no, this time it's independent of utop. native, bytecode or utop.
This could be due to a change in the Marshal module. Previously, Marshal would raise End_of_file more easily. Nowadays, it raises Failure sometimes. Can you try with different versions of OCaml (opam switch command)? To see if starting at some previous version your bug disappears.
in what version would you suspect a change? (i am probably using some newish features but 4.02.1 may work)
The INRIA ocaml mantis search is too dumb. I can't find back. But: make sure that 4.02.1 work, then use opam switch to go to the version just after it, until you find your bug. This change in Marshal's behaviour has hit me so I know about it.
hi, i checked and 4.02.1 already shows the same behavior unfortunately.
Try to go back a few versions earlier. Use git bissect if you want to minimize your number of tests.
I would suspect 4.01.0 had the right unmarshalling behavior.
nope, 4.01.0 throws the same exception for me. i am not sure if i'm not violating some assumptions made by parmap, since i use external objects from C via gsl. but the fact that smaller matrix sizes work and give reasonable results suggests that i am not doing something terrible.
Hmm... this might be an issue related to the size of the elements passed along by parmap, see the comments above the lines https://github.com/rdicosmo/parmap/blob/master/parmap.ml#L129-L131
You may try the alternative code that was commented out in those lines, and if it works, than that will have nailed the bug.
On Mon, Feb 27, 2017 at 07:09:16AM -0800, nilsbecker wrote:
nope, 4.01.0 throws the same exception for me. i am not sure if i'm not violating some assumptions made by parmap, since i use external objects from C via gsl. but the fact that smaller matrix sizes work and give reasonable results suggests that i am not doing something terrible.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
i looked at it - since i'm on os x i'm not sure there is code that i can simply uncomment and use? or can i?
Well, no, on Mac OSX any big file allocation would actually fill up the virtual memory (and your swap space) :-(
2017-02-28 11:01 GMT+01:00 nilsbecker notifications@github.com:
i looked at it - since i'm on os x i'm not sure there is code that i can simply uncomment and use? or can i?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rdicosmo/parmap/issues/18#issuecomment-282995926, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp-v_b6s4jNZ6HrBUVSZxSNLevuE6lEks5rg_CNgaJpZM4BCVhm .
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
about memory: i'm now on a machine with 32G ram and for the parameter settings that still work, i have about 500M per thread (i have 8 of them running).
an update: i now tried the same thing on a linux machine with 16GB of ram. at parameter settings where OS X throws an exception, the program runs through on linux (!) this is for a 512x512 matrix used internally in each thread.
it's possible this is an os x thing. will try with bigger matrices on linux to see if i can make it fail.
Ok, that's great to know :-)
On Wed, Mar 01, 2017 at 12:07:36PM -0800, nilsbecker wrote:
an update: i now tried the same thing on a linux machine with 16GB of ram. at parameter settings where OS X throws an exception, the program runs through on linux (!) this is for a 512x512 matrix used internally in each thread.
it's possible this is an os x thing. will try with bigger matrices on linux to see if i can make it fail.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.*
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
Metro Bibliotheque F. Mitterrand Ligne 6: Dugommier ligne 14/RER C Ligne 14/RER A: Gare de Lyon
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
i confirmed that linux does not throw the exception also with bigger memory requirements (16 times what i had before). it seems that the issue i am seeing is specific to os x. just to be sure, in fact the list of results that is finally returned by parmap is not big. big memory allocation happens only internally in each parallel thread.
Thanks Nils, unfortunately, I do not have an OS X to test, so I cannot be of big help here... I wonder if you can use some tools (strace/ptrace) to see what exactly happens when the exception is triggered, or whether other people here with OS X may lend a hand.
2017-03-02 11:37 GMT+01:00 nilsbecker notifications@github.com:
i confirmed that linux does not throw the exception also with bigger memory requirements (16 times what i had before). it seems that the issue i am seeing is specific to os x. just to be sure, in fact the list of results that is finally returned by parmap is not big. big memory allocation happens only internally in each parallel thread.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rdicosmo/parmap/issues/18#issuecomment-283618026, or mute the thread https://github.com/notifications/unsubscribe-auth/AAp-v54gGp-K7D0rvOmSZ_G--DzU9q97ks5rhpvTgaJpZM4BCVhm .
-- Roberto Di Cosmo
Office location:
Paris Diderot INRIA
Bureau 3020 (3rd floor) Bureau C123 Batiment Sophie Germain Batiment C 8 place Aurélie Nemours 2, Rue Simone Iff Tel: +33 1 80 49 44 42
GPG fingerprint 2931 20CE 3A5A 5390 98EC 8BFC FCCA C3BE 39CB 12D3
i could try but i don't have much experience. last i tried native debugging of ocaml on os x it was discouraging...
ok, i have a backtrace now at least from running with OCAMLRUNPARAM=b
Fatal error: exception Failure("input_value_from_block: bad object")
Raised by primitive operation at file "bytearray.ml", line 102, characters 0-68
Called from file "parmap.ml", line 99, characters 11-34
Called from file "parmap.ml", line 209, characters 13-46
Called from file "parmap.ml" (inlined), line 530, characters 4-65
Called from file "test_like.ml" (inlined), line 557, characters 17-40
Called from file "test_like.ml", line 559, characters 2-244
Called from file "test_like.ml", line 570, characters 2-17
i would be open to try debugging this more if someone can give some guidance how to do it on os x. in that case it would probably best to open a new issue
i found this:
https://mail.scipy.org/pipermail/numpy-discussion/2012-August/063589.html
which may indicate that the optimized blas/lapack on os x is in conflict with what parmap does to parallelize. if so that would be unfortunate... then the only fix would be to link lacaml to some alternative blas/lapack which could be a pain...
A colleague is getting either End of File or Failure("input_value_from_block: bad object") in code that uses parmap. He has: OS: 18.04.5 LTS (Bionic Beaver) ocaml: 4.05.0 parmap: 1.0 rc8-1build1 He is running the code in docker. The code has a long running time. The data may be very large.
@JuliaLawall a workaround idea' if he is using Parmap.parmap, try using the parany library instead and Parany.Parmap.parmap. If he is not using Parmap.parmap, then he will have to use the more generic Parany.run directly. Parany is designed to work with potentially infinite lists of things to compute.
On Fri, 5 Feb 2021, Francois Berenger wrote:
@JuliaLawall a workaround idea' if he is using Parmap.parmap, try using the parany library instead and Parany.Parmap.parmap. If he is not using Parmap.parmap, then he will have to use the more generic Parany.run directly. Parany is designed to work with potentially infinite lists of things to compute.
Thanks. I don't know if the problem is the number of things to compute or their size.
julia
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, orunsubscribe.[AAD2ZGSEFG7SQDV3H7QT443S5SS2PA5CNFSM4AIJLBTKYY3PNVWWK3TUL52HS4 DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFYUCJKY.gif]
A colleague is getting either End of File or Failure("input_value_from_block: bad object") in code that uses parmap. He has: OS: 18.04.5 LTS (Bionic Beaver) ocaml: 4.05.0 parmap: 1.0 rc8-1build1 He is running the code in docker. The code has a long running time. The data may be very large.
Hi Julia, the errors you mention point to data corruption when transferring values between different processes: this may be caused by very large data being shuffled around. It is difficult to do something to help your colleague without a MWE, though ...
On Sat, 6 Feb 2021, Roberto Di Cosmo wrote:
A colleague is getting either End of File or Failure("input_value_from_block: bad object") in code that uses parmap. He has: OS: 18.04.5 LTS (Bionic Beaver) ocaml: 4.05.0 parmap: 1.0 rc8-1build1 He is running the code in docker. The code has a long running time. The data may be very large.
Hi Julia, the errors you mention point to data corruption when transferring values between different processes: this may be caused by very large data being shuffled around. It is difficult to do something to help your colleague without a MWE, though ...
I guess MWE is minimum working example? I suspect that minimizing would cause the problem to disappear. We may be able to avoid the problem by dropping unnecessary data. I mostly wanted to record in the discussion that the problem is not specific to MacOS.
julia
Anyone saw this before?
I have some code using Parmap that crashes in parallel but not in sequential (no Parmap).