open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

Possible OOP Issue #7031

Closed Willian-Girao closed 7 months ago

Willian-Girao commented 5 years ago

Background information

Hello there. I'm a senior software developer and a masters level Computer Science researcher/student. I'm currently using OpenMPI to simulate communicating sensors within a sensor network (each process is a sensor and they "only" have to communicate with the other sensors in a pretty straightforward fashion).

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

I'm using openmpi-4.0.1.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed in a Linux environment using the command line as the following:

sudo apt-get install openmpi-bin libopenmpi-dev

Please describe the system on which you are running


Details of the problem

Context: As mentioned above, I'm simulating a sensor network, where each process represents a node with its specific id. The program works in a pretty straight forward manner: each sensor sends specific pairs of messages/acknowledgment_messages to the "neighboring sensors" it is in contact with and, after all the neighbors have sent back their respective acknowledgment messages, the sensor decides what to do - which is basically just send another message for status update within the sensor code.

Problem: my original code (which followed a OOP paradigm) behaves accordingly to what I expected and it outputs the same (the output for this particular problem is always deterministic) now matter how many times I run the code for the same instance of the problem on my Windows 10 computer. The problem actually only appears when I run the same exact code on a Linux machine: the output of the program varies for n executions (when, as said before, it should be deterministic), being correct around 70% of the executions (outputing what it was supposed to output - same as what is outputed when run on the Windows 10 machine), and having a totally random behavior for the other runs. I have added many debug options to my code and have been trying to pin down what was going on for a couple weeks and, after I generated a log of the messages that are sent between the sensors (the processes), I noticed that on those 30% of the times the program misbehave the processes simply do not send all the messages they're supposed to be sending during the "healthy executions" (the messages that are not sent also varies throughout the "unhealthy executions").

Solution I found: after failing in any other attempt I gave a last shot and just restructured my (C++) code in a procedural manner (did not change how it processes the data, just put it all on the same file, again, in a procedural way) and, to my surprise, it actually worked and the executions started to behave on Linux the same way they do on the machine with Windows 10.

So, apparently OpenMPI doesn't like OOP. The whole code was parameterized inside the class I created to implement the sensor's behavior. I'd be very glad to share my code with you but, as this is an ongoing research, I'm not able to share any of my code yet. Once the paper is published I'll be able to send you both codes (the original OOP one and the procedural version that worked).

In order to compile and run the OOP version of the program I do:

mpicxx -c SensorNode.cpp -lCGAL -lgmp -frounding-math -o sensor.o && mpicxx -c solver.cpp -lCGAL -lgmp -frounding-math -o main.o && mpicxx main.o sensor.o -lCGAL -lgmp -frounding-math -o solver_exe && mpirun -np # -mca btl sm,self --allow-run-as-root solver_exe

In order to compile and run the Procedural version of the program I do:

mpicxx main.cpp -lCGAL -lgmp -frounding-math -o out_executable && mpirun -np # -mca btl sm,self --allow-run-as-root solver_exe

Where, in both cases, # is the number of processes. For # > 500 I have to set the maximum number of open files (ulimit -n) within the system as the following (otherwise it gives me a "too many files opened" type of error):

ulimit -n X

the original value of X is 1024, I usually set it to 90000. This is why I have the "--allow-run-as-root" argument on the above command lines.

If needed, you can directly contact me at wsg@icomp.ufam.edu.br. I hope this content helps. Cheers.

bosilca commented 5 years ago

If the deterministic order of your application is dictated by the MPI message order constraints (in order matching), then the only way to fail the determinism test is to not respect the FIFO message ordering. This is possible, but highly unlikely.

Being able to see a reproducer would be extremely helpful. If you can share your code with me privately I'll be happy to take a look.

Willian-Girao commented 5 years ago

No, the deterministic order regards simply to the total number of msgs sent between the processes that "can see each other", which is stored in a variable (to my understanding in a memory location which is only accessible by the owning proceds). Given that this number depends on the number of comunicating sensors (simulated as comunicating processes) and what processes each one of them can comunicate with, and that does not change between execution, than the only explanatio possible seems to be something associated with the SO - or the OOP point in my case, given that procedural coding resolved it and yeilded the same reslts - given that this did not happen with both codes on Linux.

On Wed, Oct 2, 2019, 16:09 bosilca notifications@github.com wrote:

If the deterministic order of your application is dictated by the MPI message order constraints (in order matching), then the only way to fail the determinism test is to not respect the FIFO message ordering. This is possible, but highly unlikely.

Being able to see a reproducer would be extremely helpful. If you can share your code with me privately I'll be happy to take a look.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/7031?email_source=notifications&email_token=AKL5WODPLCPCVEB3B5ZOXA3QMTWWFA5CNFSM4I4YSTNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAF3ADI#issuecomment-537636877, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL5WOE3Z3PTYVJBKFYXKA3QMTWWFANCNFSM4I4YSTNA .

bosilca commented 5 years ago

If the counting is incorrect either we are dropping messages or there are threads and the counting is not atomic. In both cases it is difficult to assess without a reproducer.

Willian-Girao commented 5 years ago

Perfect. As I said, once the publication is out I can send the original files over.

On Wed, Oct 2, 2019, 16:47 bosilca notifications@github.com wrote:

If the counting is incorrect either we are dropping messages or there are threads and the counting is not atomic. In both cases it is difficult to assess without a reproducer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/7031?email_source=notifications&email_token=AKL5WOELRH6ECU2O54T72MTQMT3DJA5CNFSM4I4YSTNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAF6HZQ#issuecomment-537650150, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL5WOHXT6V3ZQPI7UXTDXLQMT3DJANCNFSM4I4YSTNA .

ggouaillardet commented 5 years ago

This question was initially asked on Stack Overflow at https://stackoverflow.com/questions/58196889/mpi-program-behaves-differently-on-linux

Note Ubuntu 18.04 is shipping Open MPI 2.1.1 and not the latest 4.0.1.

You can also try an other MPI library (and/or version) such as MPICH and see how things go. If your OOP code also fails on Linux/MPICH, the odds there is indeed a bug in your code are pretty high.

Willian-Girao commented 5 years ago

I was the one who posted the question on StackOverflow actually. But if you guys are still positive that there's a bug in my code it's fine, I was just reporting.

On Wed, Oct 2, 2019, 20:27 Gilles Gouaillardet notifications@github.com wrote:

This question was initially asked on Stack Overflow at https://stackoverflow.com/questions/58196889/mpi-program-behaves-differently-on-linux

Note Ubuntu 18.04 is shipping Open MPI 2.1.1 and not the latest 4.0.1.

You can also try an other MPI library (and/or version) such as MPICH and see how things go. If your OOP code also fails on Linux/MPICH, the odds there is indeed a bug in your code are pretty high.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/7031?email_source=notifications&email_token=AKL5WODRJOHHVTR5236IUPTQMUU5XA5CNFSM4I4YSTNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAGP7WA#issuecomment-537722840, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL5WOCNHA3D2OSSUZCWM2DQMUU5XANCNFSM4I4YSTNA .

ggouaillardet commented 5 years ago

Please do not misinterpret my previous comments.

I am not stating your conclusion (e.g. there is a bug in Open MPI) is incorrect (you have not shared a way to evidence the issue, so no conclusion is possible at that stage), but I am saying the way you reach your conclusion is incorrect. I also gave you a way to better assess the odds whether the issue is in your code or (a given version of) Open MPI.

You are also more that welcome to share a simple code that evidences the issue.

github-actions[bot] commented 7 months ago

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] commented 7 months ago

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!