hannorein / rebound

💫 An open-source multi-purpose N-body code.
https://rebound.readthedocs.io/
GNU General Public License v3.0
820 stars 217 forks source link

Problems when using MPI #689

Closed chriskilday closed 1 year ago

chriskilday commented 1 year ago

Hello,

I've recently been working on creating a binary star system with a self-gravitating disc using MPI and the tree gravity solver to run the work in parallel. I know this is a less commonly used area of rebound, and I have come across a few problems, some of which I have solved, and some are still ongoing.

First, I found an issue related to the stars being located on different nodes in the tree. With the binary orbit centered on the origin and using an even number of tree roots (meaning their intersection is also at the origin), the stars would fly apart as if the other one didn't even exist. I didn't find the exact cause of the problem, but got around it simply by using an odd number of root boxes, so the binary stays within one box.

I'm also having a weird issue with the reb_tools_particle_to_orbit() function. When calling it, I get either a seg fault or a bus error, but is random as to which. It also seems random when the problem arises at all. It happens immediately every time with 10,000 particles in the disc, almost never happens with only 100, and using 1,000 particles, sometimes it comes up immediately, other times it'll run smoothly for a few hundred time units before crashing. The particles in the disc are initialized with some randomness, but that is the only difference between runs. I tried looking into the issue and created my own particle_to_orbit() function. My plan was to gradually copy the existing rebound function nearly line by line until I found the problem. For the purposes of my project, I don't need every single parameter the rebound function calculates and some of the error checking is unnecessary for me, so I left a few lines out, but other than that, everything is calculated the same. However, I never found the source of the problem. My own function runs fine without crashing so the issue has to be in something I left out, but just looking over it I can't see where. Since I got all of the information I required, I just moved on and have been using my own version of particle_to_orbit.

There's also occasionally a seg fault that pops up when writing out data from the gravity tree. I've been writing the information from the tree to a file to easily read in the density later, and while it works fine most of the time, very infrequently it seg faults. I've only seen the problem come up after the simulation has already been running smoothly for hours and haven't been able to narrow the problem any further than "it's something to do with the tree."

Finally, I am consistently getting an error on the cleanup of the simulation. Whenever the simulation reaches the end, I get a very long backtrace printed telling me both Error in pb.exe: munmap_chunk(): invalid pointer: 0x0000000000dcbde8 and Error in pb.exe: free(): invalid pointer: 0x0000000000c3e468 It happens every time, but as long as I throw in an MPI barrier before the calls to reb_mpi_finalize() and reb_free_simulation(), it doesn't affect the data, so I've just been ignoring it.

I also noticed that some of the functionality in the tools package isn't intended for running work in parallel, so I wrote mpi-safe versions of get_com() and move_to_com(). I've attached both of those functions, as well as my version of particle_to_orbit below. You can also find the setup for the project I've been working on.

Rebound-Additions

hannorein commented 1 year ago

Hi Chris,

I haven't used the MPI part myself for quite some time. It definitely needs some time to get up to speed. I might have some time over the coming months to look at it in more detail.

If you get a segmentation fault, something is definitely wrong and I would not ignore it. Often, the key to figuring out what's going on is to come up with a reproducible scenario that always results in a segmentation fault. If you use random number generator, always use the same seed when testing.

I'm surprised the reb_tools_particle_to_orbit function is giving you trouble. Are you sure you are giving it valid pointers to particles? Print out the particle structures before passing it to the function to see if the problem is somewhere else.

hannorein commented 1 year ago

Just a quick update. I can reproduce the issue with the binary. The gravity between the two stars is not taken into account because of this line. The reason this line is there is to avoid self-interactions. This works fine for non MPI runs. But when MPI is used, then the particle indices for two different particles can be the same (because they are on different nodes).

I need to think a bit about how to resolve this. Either this is a really old bug that just never has been discovered yet, or I have changed something over the last few years that messes things up and I didn't notice because I have not used MPI myself lately.

hannorein commented 1 year ago

I've pushed a potential fix to the mpi_ci branch. If you have a chance, let me know if this helps (or introduces any other issues). I'll continue to look into the MPI stuff a bit more...

hannorein commented 1 year ago

One more thing that might help you: If you get a segmentation fault in the _to_orbit() function, that's probably because you pass a particle structure which contains an invalid sim pointer. This might occur if you manually copy particles around. To avoid the issue simply set the pointer to zero:

struct reb_particle p = ...;
p.sim = NULL;
reb_tools_particle_to_orbit(r->G, p, primary);
hannorein commented 1 year ago

I've merged the changes into the master branch. If you notice any other bugs, please open a new issue!