Closed abhishek1297 closed 5 months ago
Interesting. Well, not all versions of MPI support what SST needs for running as a dataplane. In particular, the MPI implementation has to support threaded operation and MPI_Open_port() for allowing different running MPI jobs to connect to each other. MPICH usually allows this, other MPI implementations may not. So you might try another version of MPI if possible.
WRT libfabric. It looks like OmniPath has it's own libfabric provider, which SST has not been configured to support. (Libfabric providers often support different and sometimes mutually exclusive sets of libfabric features, so in practice code that utilizes libfabric, like our RDMA data plane, has to be customized for each. It looks like we're finding some provider it recognizes (maybe a partially-functional verbs provider?), but then segfaults. If this is happening, then almost certainly what needs to happen is for someone with knowledge of SST and libfabric to dive into the details, but that's a longer term process that requires access to that or a similar machine. In the short term, your best bet is to try for a different MPI implementation that might support what SST needs.
If that's the case, it's best to make our runs inside a container. But, Just to confirm...using SST with MPI transport, in this case, should leverage OmniPath?
If that's the case, it's best to make our runs inside a container. But, Just to confirm...using SST with MPI transport, in this case, should leverage OmniPath?
I'm going to tentatively say yes, but ultimately it depends upon MPI and the situation that it's run in. Our MPI data plane ends up doing an MPI_Send/Recv pair to transfer bulk data (control messages go over TCP). We're counting on a properly configured MPI to be using whatever RDMA network is available. Generally on the large-scale HPC machines that ADIOS users care about a lot of effort has been put into porting and tuning MPI, so that's a good bet. However, mentioning containers worries me. RDMA succeeds by direct access to low-level network hardware. Containers often severely limit access to hardware unless specifically allowed, properly configured, run with privs, etc. So all I can say is that if MPI_Send ends up using OmniPath, yes, you're in good shape. But unfortunately that's beyond the control of ADIOS...
That is good to know thanks. We will reinstall adios without fabrics and try.
We are trying use Adios2 SST for RDMA support on [Jean-zay] (http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html) which does not have Infiniband but an MPI-OmniPath network.
We tried to install Adios with libfabrics, along with the MPI option set during the installs. I am not fully familiar with fabrics, (correct me if I'm wrong) but sit seems, fabrics does not support OmniPath network and throws a segmentation fault while running a simple SST example. If we want leverage OmniPath of MPI. Do we use
DataTransport: MPI
option??Although, I did try to make a run with MPI option, Adios threw an error that it was not able to find MPI. Seemed strange given that it installed with MPI support.
Running an extended Adios2 SST examples taken from your repo throws following errors.