ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
415 stars 90 forks source link

Add specific communicator for neighborhood communication #1588

Open MarcelKoch opened 8 months ago

MarcelKoch commented 8 months ago

This PR adds a communicator that only handles neighborhood all-to-all communication. It implements the new interface collective_communicator, which provides different implementations for a selected set of collective mpi routines. Currently, this only includes the non-blocking all-to-all.

The communication uses a fixed pattern, i.e. the send/recv sizes are fixed when the neighborhood communicator is constructed. I would have liked to decouple that, but this would require some knowledge of how the sizes are stored at the interface level. If someone has an idea for that, please let me know.

This is the first part of splitting up #1546.

The neighborhood all-to-all has a bug in OpenMPI < v4.1.0, which makes it necessary to disable the neighborhood communicator in this case. As replacement, there is also a dense all-to-all communicator.

Todo:

PR Stack:

MarcelKoch commented 4 months ago

Looks good to me. But there is an issue with OpenMPI that needs to be resolved. It seems that OpenMPI versions < 5.0.x still have this bug

I'm not sure what you mean here. I already have this workaround to use non-null pointers, which should fix the issue.

pratikvn commented 4 months ago

The null test is still failing here. I thought that was because of the null arguments ?

MarcelKoch commented 4 months ago

Thanks for bringing that up. I guess I've not looked really at the tests in quite a while. Just a remark to the OpenMPI issue you linked, that one is for the persistent all-to-all-v call, which I'm not using (yet). So the issue shouldn't be relevant to the test failures.

MarcelKoch commented 4 months ago

So right now, I'm leaning to just deactivate the neighborhood communicator if the openmpi version isn't sufficient. The CollectiveCommunicator interface is meant to provide different implementations of the all-to-all call, so using a dense all-to-all here feels odd. I think I would rather have a derived class that only does the dense all-to-all and use that instead.

pratikvn commented 4 months ago

Yes, but the fix is missing in older versions of 4.0.x as well. I am not entirely sure which exact version of OpenMPI our container uses, but it is 4.0.x, and 4.0.7 is still missing the fix. I think the fix was added only in 4.1.x.

But in general, I agree. I think we can just disable the neighborhood communicator for older versions of OpenMPI.

MarcelKoch commented 1 month ago

@upsj I think the moved-from that should be using MPI_COMM_NULL as communicator. Then trying to launch a communication will fail quite badly.