pelahi / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
19 stars 27 forks source link

Running VELOCIraptor on large RAMSES cosmo hydro snapshots #119

Open sorcej opened 2 months ago

sorcej commented 2 months ago
pelahi commented 2 months ago

Hi @sorcej so I think it is likely the read header isn't reading the header correctly. I will say that I have tried to make it work but I've encountered several different ramses formats with different structure in the header. Since the binary is not self-describing I did not know how to make it work. Could you provide a description of the binary data? Also, are you trying the development branch? It actually is better than main (I need to change this to be the default).

sorcej commented 2 months ago

Thanks @pelahi ! Ok so I switched to the development branch. I have the same issue.

I do not have a reader in C++ unfortunately but common readers in Python or Fortran typically work on the simulation (For instance, https://github.com/florentrenaud/rdramses/blob/master/rdramses.f90 works after some modifications to use longint). So there should not be anything specific but for the fact that I had to use longint for both DM and star particles

sorcej commented 2 months ago

Here might be some of the problems in the reader: // Total number of Stars over all processors Framses.read((char)&dummy, sizeof(dummy)); Framses.read((char)&ramses_header_info.nstarTotal, sizeof(int)); Framses.read((char*)&dummy, sizeof(dummy));

-> since I am using longint for nstarTotal (because of more stars than 2^31). However it is probably not the only one... not sure whether it is easy to have an option for longint for both nstartot and nparttot.

sorcej commented 2 months ago

(@pelahi ): Ok , I have fixed and modified a few things, now I am stuck a bit further in MPINumInDomain. Also I did not get what this is: dmp_mass = 1.0 / (opt.Neffopt.Neffopt.Neff) * (OmegaM - OmegaB) / OmegaM; as it gives something negative so I bypassed it using Particle IDs instead.

pelahi commented 2 months ago

Hi @sorcej , apologies I'll be a bit slow in replying for the next two days as I have to finishing marking assignments for a high performance computing course. Do you mind creating a draft PR with your proposed changes so I can have a look?

sorcej commented 1 month ago

Thanks @pelahi and sorry for the delay. I cannot do any PR from the supercomputer unfortunately. Anyway, I tried to understand where it was further crashing and I finally managed to pinpoint it. It is in: MPIInitialDomainDecompositionWithMesh(opt). Not sure if it is because I am using too many or not enough cores perhaps? -> ok so found out, I was using too many cores, consequently, n3 did not fit in an unsigned int. With less core, now I am stuck a bit further when broadcasting... I am continuing to explore

sorcej commented 1 month ago

After several days, I decided to simply remove the option opt.impiusemesh to try to go further. I will try later to re-add it...

sorcej commented 1 month ago

@pelahi ok so now I am again stuck a bit further. I again had to bypass dmp_mass = 1.0 / (opt.Neffopt.Neffopt.Neff) * (OmegaM - OmegaB) / OmegaM; as it gives something negative so I had to use Particle IDs instead in mpiramsescio.cxx this time. It starts running and counting properly particles (note that I did not understand why we had to do it twice as it was already done but ok). Now I have a malloc(): corrupted top size [mpiexec@i06r03c04s12] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1324): assert (!closed) failed [mpiexec@i06r03c04s12] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status [mpiexec@i06r03c04s12] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2045): error waiting for event

... I am trying to fix that one too.

pelahi commented 1 month ago

Hi @sorcej , so the dmp_mass calculation was based on reading some ramses data where it was it was easier to quickly calculate the mass for dark matter particles using the matter density - the baryon density and the effective resolution of the simulation. Regarding your error, it could be something to do with ints being used and where values would exceed 2e9 and then give an unsigned number. Could this be the case?

sorcej commented 1 month ago

@pelahi ok thanks, now I get the dmp_mass. I actually had to change manually Neff though. I will try to add it as an option (unless there is an option that I missed). Regarding the malloc. I am not sure yet but I think something went wrong when reading the particle positions thus they cannot be properly assigned to the different tasks. I am trying to fix it.

sorcej commented 1 month ago

@pelahi Ok I have fixed the particle positions. I still need to understand something with the IDs but it looks better now. I think I have now yet another problem to solve when writing: [mpiexec@i05r01c01s04] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:360): write error (Bad file descriptor) Any idea where precisely there could be something wrong here? I am trying to pinpoint it but it is not obvious.

I still of course have load balancing issues but for now I leave it as is.

I might also have to fix some units but I will see later.

Thanks again :)

pelahi commented 1 month ago

Hi @sorcej , not certain about the write error. When does this happen? Could you provide the associated velociraptor output?

sorcej commented 1 month ago

@pelahi The code writes the files .configuration, .siminfo and .units and then stops but it is still in the middle of doing SearchFullSet, I confirm that it is stuck in pfof=SearchFullSet(opt,Nlocal,Part,ngroup) again... I fixed the long IDs and the long trees but I still have a seg fault and the debug mode is not helpful.

sorcej commented 1 month ago

@pelahi ok finally managed to narrow down where it crashes. It is when trying to build a new tree: tree = new KDTree(Part.data(),nbodies,opt.Bsize,tree->TPHYS,tree->KEPAN,1000,0,0,0,period);

I have yet to find out why it crashes for some of the tasks. Sometimes it gives me a free() invalid pointer and sometimes a double free or corruption (out) but I do not get why and why only for some of the tasks

pelahi commented 1 month ago

Hi @sorcej , would you have a small ramses input example I could try? It would help me debug the issue.

sorcej commented 3 weeks ago

Hi @pelahi, sorry for not getting back to you earlier. I was attending conferences (actually I will be leaving again on Sunday). Anyway, it works with a small example. That said I managed to get outputs when dealing only with DM particles even with the large simulation now. I need though to check these outputs and whether they make sense. Then I will try to deal also with the stars, etc. Thanks