ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Providing a clean, compatible NOMPI API #183

Open ax3l opened 6 years ago

ax3l commented 6 years ago

Hi ADIOS team,

during the last weeks I came across a MPI design issue in ADIOS1 that is quite troublesome downstream with the serial (_nompi) API.

The problem is the separation into two libraries which are defining the same API, same symbols but different functionality for the public API (& ABI). Both libraries are incompatible to each other and can not be linked at the same time without getting into trouble.

Currently, ADIOS just "mocks" MPI with mpidummy.h to regular fs, mem access and copy operations in case of the nompi interface. This is nice if one knows from the beginning that a final app will either use MPI or serial only. But that is not a compile-time decision, this is a user decision at runtime. What that leads to is, that shipped "parallel" ADIOS library can still use the "serial" API (by passing MPI_COMM_NULL) but the final app must be started in a MPI_Init context. Urgh.

Why not just link the _nompi lib?

The _nompi library and the regular "parallel" library can not be linked at the same time (outside of manual handling with dlopen), since they define the same public API (unnecessarily) and ABI but with different implementations. This forces to keep the "separation of libraries" in any downstream project and always stays a compile-time decision which is super inconvenient in package managers, high-level libraries, dependencies, etc.

There is no reason not to link a "MPI-powered" library (with additional serial API) in a serial context as well. The library should just never call "MPI_" functionality if it is not getting passed a valid communicator of some kind.

What ADIOS is basically doing is shipping an MPI mock library. Mocking functionality is great for development, debugging and testing but causes the symbol issues explained above in production.

Proposed Refactoring

Could you maybe either:

How to reproduce the issue

cc @pnorbert @jychoi-hpc @isosc