pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 26 forks source link

mpirun Error reading multifile gadget snapshot #111

Closed jegarfa closed 10 months ago

jegarfa commented 1 year ago

Hi! I am running VELOCIRAPTOR on a Gadget2 snapshot. The snapshot is splitted in 1024 files from snap.0 up to snap.1023 and it has 2048^3 particles in gadget format. When running VELOCIRAPTOR on this snapshot I got the following reading error:

reading /data/simulation/LCDM/LCDM_z1p000.1002
reading /data/simulation/LCDM/LCDM_z1p000.1003
can't open file reading /data/simulation/LCDM/LCDM_z1p000.1004
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 0 on
node deimos exiting improperly. There are three reasons this could occur:

I have checked and the file LCDM_z1p000.1004 exists and it is not corrupted. Then I tried runing VELOCIRAPTOR on a different simulation which also have a snapshot made of 1024 subfiles and I got the same error. Can it be related to a limit in the number of subfiles to be read? It is always happening with the ".1004" subfile, which I do not understand.

Any help will be very much appreciated,

Thank you in advance!

pelahi commented 1 year ago

Hi @jegarfa, sorry for the late reply (been on holidays and then catching up with work). Your error is unusual. I have read inputs of 512 gadget 2 files but never tried with 1024, though I have used the code to read 4096 ramses input files. The odd bit is that the error message is occurring when the file is just being opened https://github.com/pelahi/VELOCIraptor-STF/blob/6f4b760ef5043b959a922a8e7ae453fd0a9f988f/src/gadgetio.cxx#L149-L152

Could it be that on the compute node it is unable to read these files? Also, is the code at least able to read the file at all at earlier points? I should update the gadget based input, mpi decomposition as it is missing checks to see if it can read the file (see function https://github.com/pelahi/VELOCIraptor-STF/blob/6f4b760ef5043b959a922a8e7ae453fd0a9f988f/src/mpigadgetio.cxx#L506) but my guess is that here it fails to read the file too but since I don't report the error and throw an exception, it just keeps chugging along.

For completeness sake, can you try using the feature/codecleanup-icrar branch. This will be VELOCIraptor 2 but I am still testing some stuff (been delayed due to other commitments).

jegarfa commented 1 year ago

no worries @pelahi . So, straight to the point I think that the error was related to lack of memory in a node. I notice that the amount of memory required was bigger than the one I was using. I decided to change to a machine with different node configuration and more memory and I did not get this error. All the files were readed without problems.

I also tried the feature/codecleanup-icrar branch in both machines and I got the same results (same error in the machine with lack of memory and no error in the other one).

Although now they files are readed fine I am getting a different error when runing velociraptor-stf, I prefer to point this in a different github issue to avoid mix both topics.