ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
269 stars 126 forks source link

BP4 does not implement the BP3 "async file open" #2020

Open eisenhauer opened 4 years ago

eisenhauer commented 4 years ago

The BP3 engine, like ADIOS1, had a feature where one could "pre-Open" a file, which helped protect applications from long delays on machines where a system-level file open might take a long time. Specifically, in BP3, the actual ADIOS2 Open() call for writers would spawn an asynchronous task for the sytem level open, and avoid use of the system file for the duration of ADIOS Open to ensure that ADIOS Open could return before system-level open() completed. BP3 writer would join the async task before the first write() that actually used that system-level resource, but between the Open and the first write, the application was free to do other things, essentially hiding the latency of system-level open. However, the addition of index files in BP4 changed the nature of Open, and it no longer refrains from touching the system-level resource. This means that in BP4, it is not possible to pre-Open a ADIOS file and hide the latency of the system-level open() call. Instead, Open blocks internally and waits for the system open to succeed so that it can write to the index files. The pre-Open functionality should be implemented in BP4 as it is in BP3. (See discussion in PR #2007.)

eisenhauer commented 4 years ago

@pnorbert The inability to asynchronously open BP4 files effectively is essentially a feature loss. Should fixing this be considered before the release?

pnorbert commented 4 years ago

BP4 requires first creating a directory, on which every aggregator depends. Then each aggregator creates one data file (which is threaded). Rank 0, however, creates 3 files. These three files are created by three separate transport managers, so it does consume 3 threads/futures to create them.

The mkdir is implemented by KWSys and is called by the transport manager (in sync mode). Then the opens are implemented by the individual transports with futures. The transports don't know anything about directories and assume the file creations will succeed, so the directory must be there before calling them.

@eisenhauer mentions in this issue that the BP4 engine writes a header into the md.idx file created by the rank 0 process, effectively forcing a wait on creating md.idx. The purpose of this is to allow readers for "connecting" at open time instead of waiting until the first step is written, and therefore, behave like SST/SSC regarding opening timeouts and connecting behavior.

This piece can be refactored by creating another, empty file to signal active state instead of the header but it is unclear to me how much benefit we would see after creating the directory itself and blocking on it (by every process).

A larger refactoring of the transports/managers and the BP4 engine is needed to perform this opening act orchestrated by many processes (the aggregators) and involves creating a directory that everyone needs. More complication is added by the fact, that BP4 engine supports multiple transports for one output, so every transport call passes a vector of names. It is already mentally difficult to follow the code as it is. Therefore, I am afraid of doing such refactoring for the next release.

eisenhauer commented 4 years ago

Ugh. I agree. This was a relatively simple feature to support with BP3, but BP4 looks to be an entirely level of complexity... For now best left as a future exercise...