ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
268 stars 125 forks source link

Make Engine::Flush collective and use it to allow dumping data to disk during a step #2649

Closed franzpoeschel closed 2 years ago

franzpoeschel commented 3 years ago

Related issues: #1891 and #2629.

Suggestion: The call Engine::Flush() should be made MPI-collective and its effect should be to drain data current held in ADIOS buffers to disk. In streaming engines, this call would either have no effect or throw an exception.

Why is this feature important? Multiple reasons:

  1. This is to some extent what the documentation for Engine::Flush promises to do anyway:

    void Flush(const int transportIndex = -1): Manually flush to underlying transport to guarantee data is moved

    (from here)

  2. As described in this post, the current behavior of Engine::Flush is somewhat defective. Its current effect is that all data written up to Engine::Flush in one step is thrown away and only the data between Engine::Flush and Engine::EndStep will appear in the written file. I don't know if this somehow intended, but I can't think of a use for it.
  3. More fine-grained control over memory usage by ADIOS: Currently, the only ways to make ADIOS drain its buffers to disk are by calling Engine::EndStep or by Engine::Close. This makes it hard to control memory usage within one ADIOS step. Examples are
    1. PIConGPU: The IO routines of PIConGPU cannot present all of the data to be written in one step to ADIOS at one single point in time, but rather uses several iterations of Engine::PerformPuts, feeding data to ADIOS bit by bit. If not specifying the correct InitialBufferSize, this will lead to repeated resizing of ADIOS buffers, with negative effects on memory usage, as outlined in #2629. Also, the amount of data written in one step depends on user input and varies over steps, so the size of a step should be considered unbounded. By allowing intermediate Engine::Flushes, we could put a bound on the amount of data held at any point in ADIOS. The InitialBufferSize would become easier to specify and not require prediction of the largest step to come in a simulation.
    2. Simulations with complicated write patterns such as HiPACE (ping goes out to @ax3l). In this simulation, each parallel process will write a different output step at the same time, making it impossible to use ADIOS steps that need to be synchronized. In openPMD, we have ways to model steps independently of ADIOS, so that is not a problem in itself. But it is impossible for such writers to find an opportune point for flushing data to disk with ADIOS. Since Engine::Flush doesn't require the same level of synchrony, this would become a lot easier.

What is the potential impact of this feature in the community? Simulations with more diverse write patterns would be able to use ADIOS and have a more controlled memory usage. ADIOS optimizes towards writers with synchronous steps where one step fits fully into memory (per process) and data can be presented in a clear 2-phase workflow of (1) declare variables and enqueue Put() operations and (2) perform the enqueued operations. This feature request would allow to use ADIOS more easily in applications that do not fulfill these assumptions.

Is your feature request related to a problem? Please describe. Already outlined above.

Describe the solution you'd like and potential required effort Solution already described above. For the required effort, I am not sufficiently aware of the BP format intrinsics. I imagine the challenging points would be (1) draining the buffer several times within one step and (2) appending to a written step in the BP file. Also, slightly API breaking by changing the behavior of Engine::Flush.

Describe alternatives you've considered and potential required effort This feature request achieves two goals:

  1. Use less memory than the step size in ADIOS: No alternative, currently impossible.
  2. Avoiding memory spikes by resizing: Specifying the InitialBufferSize correctly. Needs to be either done by the user (bad UX, bad performance by default) or programmatically (requires duplication of logic in PIConGPU, only feasible if writing a new ADIOS file per step, otherwise the size of the largest step to come needs to be computed).

Additionally, according to @pnorbert, the new serializer to come late 2021 will address the inefficient buffer resizing of the current serializer(s), so that will be a more permanent solution. This feature request would not cease to be useful by this point, since point (1) will remain relevant.

eisenhauer commented 3 years ago

Just wanted to give a textual thumb's up to this feature request. We'll keep it in mind with the new serializer design...

eisenhauer commented 3 years ago

So, @franzpoeschel we have this basic functionality in the BP5 Writer engine now. That is, in the middle of a timestep BP5 has the ability to move to disk all the data that has already been subject to a put(). This can happen multiple times in a single step, emptying internal data buffers and also capturing all the data in Deferred Put()s. But BP5 works differently than BP3/4 so we've been discussing how to best expose the functionality. There are really three choices: 1) Make it triggered by some form of the existing Flush() functionality, 2) make PerformPuts() do it (which the name kind of implies), or 3) create a new interface like FlushData() that does it. I lean towards 2, for reasons I'll describe, but I think it's not a slam dunk so we thought we'd ask your input. First, the problem with 1, using existing Flush(), is that the existing flush() really is more to push whole timesteps to disk when they have been queued due to timestep aggregation. That is, it is best used between timesteps, not in the middle of one so we're really changing how it is used, as well as what it does. I think PerformPuts() is a better bet though. Currently in BP3/4, what it does is to copy data from Deferred puts into internal ADIOS buffers. This isn't terrible because BP3/4 eventually copy all data into internal buffers before moving it to disk. However, that isn't true of BP5. There we are trying to use vector I/O to do zero copy on a deferred Put() when moving its data to disk. So keeping the old semantics of "PerformPuts()" just copies deferred data internally so you can reuse those buffers is kind of silly. If you wanted to be able to reuse those buffers, you could have just done a Sync Put(). So I tend to think it makes more sense for PerformPuts() (in the BP5 writer only) actually perform the puts, that is, move data to disk. This involves making it MPI-collective, which it currently is not, but the name really makes sense for what it actually is doing, at least to me. Of course, the third alternative is to just create some new method in the interface and leave both of those older interfaces alone. FlushData() is one candidate name as reasonably descriptive, but open to others as well. Anyhow, your thoughts?

franzpoeschel commented 3 years ago

Hello @eisenhauer, sorry for the late reply, I was on vacation last week. It's great to hear that this is coming around now :)

I wasn't really aware of these semantics for Engine::Flush(). Is this documented somewhere or is there an example for how to use this? But in this case, it might be a bad candidate for exposing this functionality, yes.

So far, our workflow in the ADIOS2 backend of openPMD is to use Deferred Putting for everything, and then to use either PerformPuts (within a step) or EndStep (at the end of a step) to run the operations and to free user buffers. I like the suggestion to include the new functionality within PerformPuts as it seems to fit well in our workflow and might lead to a solution that would not require lots of changes in openPMD and user code. What makes me a bit hesitant is the idea to make an API call collective that was formerly not collective.

I have some questions:

  1. Would PerformPuts still be available for the other backends, with the old behavior?
  2. When would PerformPuts be collective? Always? Only when opening a new step? Something else?
  3. Relating to the previous question, is it possible to give PerformPuts the semantics: "Process user buffers and give them back to the user. What happens with data afterwards is engine specific, some (BP4) may copy data to internal buffers, others (BP5) may write data to disk right away."

I think my preference really depends on the answer to question 2. If the answer is "always collective", I have a feeling that this would break compatibility with code using PerformPuts noncollectively. Maybe a flag PerformPuts(Mode::Collective)?

eisenhauer commented 3 years ago

RE: Documentation on Engine::Flush(): What, a comment in the header file isn't enough for you? Or the IO::FlushAll() and ADIOS::FlushAll()? Unfortunately they aren't documented per se, and I only came to know about conflicts of usage when I ran into usages of Flush in the tests...

WRT your questions:

  1. Would PerformPuts still be available for the other backends, with the old behavior? Yes, the semantics of PerformPuts wouldn't change for other engines. In BP3/4 it would always copy to internal buffers. (The current staging engines always copy to contiguous internal data buffers at the time of Put(), so PerformPuts is a no-op.)
  2. When would PerformPuts be collective? Always? Only when opening a new step? Something else? It would remain non-collective for other engines, but we can only write data to files in a collectively-coordinated way, so it would always be collective for BP5.
  3. Relating to the previous question, is it possible to give PerformPuts the semantics: "Process user buffers and give them back to the user. What happens with data afterwards is engine specific, some (BP4) may copy data to internal buffers, others (BP5) may write data to disk right away." Yes, that's essentially what I envision the semantics being. As a side effect, it may end up freeing some internal ADIOS buffers as well (those which are holding data that was provided to Put(Sync) for example. That would only be visible via memory use monitoring, so maybe not worth documenting.

I think there are two downsides to making this functionality available via PerformPuts(). One is that some code that we know of always does PerformPuts() right before EndStep(). That is completely innocuous for BP3/4, and in BP5 has the side effect of slightly increasing the metadata size (unless I figure out a way to optimize out that extra data). The second is that yes, should anyone be doing PerformPuts in such a way that making it collective would break them, then they'd have to modify that before using BP5. Whether or not anyone might be doing that out there is unknown, and really we can't raise an exception or anything. They'd find out when they had some of their ranks hanging in PerformPuts() waiting for other calls. But maybe that's clear enough... If not, then we'd have to introduce some new call and only suitably modified code would benefit from this.

franzpoeschel commented 3 years ago

So, in conclusion, putting that functionality in PerformPuts makes sense semantically, at the cost of maybe breaking compatibility with old code. I'll put this point in the agenda for this evening's collaboration meeting so we can discuss how we should go forward.

franzpoeschel commented 3 years ago

Conclusion from the VC:

The chance of breaking compatibility is too large to put this functionality behind Engine::PerformPuts(). The openPMD-api currently uses this call noncollectively. If a user installs a new version of ADIOS2 together with an old version of openPMD, this may result in hanging applications due to the changed meaning of PerformPuts().

The current meaning of PerformPuts() is similar to what this feature request suggests, but it differs from the current behavior in performance: The new feature is a collective IO operation. In an IO framework such as ADIOS2, it is justified to make such a difference visible in the API. Due to the semantical similarity, it is still justified to put this behavior into PerformPuts(), so an extra flag seems appropriate.

Mockup suggestions:

enum class PutMode{ NonCollective, Collective };
void PerformPuts( PutMode mode = PutMode::NonCollective );
ax3l commented 3 years ago

Thanks for the summary! :)

Just naming wise, I am not sure we should call this "collective"/"non-collective". What we try to express is if we flush the buffers (given a performance hit) to disk.

This could be done either as you example, but describing the functionality as:

enum class BufferDraining{ Lazy, Immediately };
void PerformPuts( BufferDraining mode = BufferDraining::Lazy );

or as separate API calls:

PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
// ...
DrainBuffers(): // Collective

Personally, I slightly favor a separate API call since that one does not change the MPI behavior of PerformPuts(). Also, a PerformPuts(BufferDraining::Immediately) would flush the buffer of itself and all previous puts, which is a bit obfuscated compared to a separate API call.

Either way, the same thing can be expressed, though.

germasch commented 3 years ago

To add my $.02:

I think it's important to know which API calls are collective vs those that are not. Ideally, this should not depend on the underlying engines, since one of the strengths of ADIOS2 is the choice of engines that can be easily switched at run-time, but if that switching then leads to surprising hangs, it's not really worth that much. On the naming, I'd be kinda in favor of following the HDF5 terminology, ie., independent vs collective.

It is fundamentally fine to me to change the collective-ness of given API call based upon an argument that declares whether it's called collectively or not.

What I think it not great is to mingle collectiveness with differing semantics. If PerformPuts is to ensure that the pointers/buffers previously passed to Put() can be reused after its return, that's one semantic issue, and I think it should keep this meaning.

That's different from the semantics of ensuring that data actually makes it to disk, so I think a different API call is in order for that. This latter call (e.g., DrainBuffers) may have its own collective-ness requirements, and ideally so consistently across engines, but that's secondary IMO.

franzpoeschel commented 2 years ago

If you see it justified to give this its own call, I'm also open to that, it's your API after all ;)

For us, the important part is to not break compatibility in the way that not distinguishing collective and non-collective PerformPut()ing would do.

williamfgc commented 2 years ago

FYI, EndStep calls Flush internally, see here. Flush does local data flushing, whereas metadata in BP3,BP4 is collective by default (which makes flushed data non-usable until proper collective metadata writing). Only in BP3 we can skip collective metadata writing (it's expensive and done for checkpoint-restart mostly with adios1.x bpmeta used for reconstruction), that's the trade-off of BP4 as metadata is only "appended" at the cost of EndStep being collective.

One of the reasons Flush is not fully advertised is that's it's one of those internal functions required publicly by an application that not all engines are compliant with.

I'd just keep PerformPuts as-is (it's already being used in production with a defined semantics across all engines), and reuse/rethink Flush to allow the option to make it collective without ending a step (the challenge would be to write partial step metadata, although the BP format is flexible in that sense). EndStep is well layered, so it's shouldn't be difficult to reuse existing functionality, but to test. Hope it helps.

ax3l commented 2 years ago

Thanks for the details. Yes, our problem can be summarized as we have workflows where we cannot call EndStep early enough to not run out of memory.

williamfgc commented 2 years ago

@ax3l sounds like your workflows gotta Flush :)

franzpoeschel commented 2 years ago

First results of this in PIConGPU

In this test run, this allowed saving ~15Gb of peak main memory (i.e. the entire InitialBufferSize) at a cost of increasing the runtime. BP5 still seems noticeably slower than BP4, even without this feature, but I have not really tuned things for BP5 yet.

Thanks a lot for implementing this! This will be a very important feature for us on Frontier because of that system's rather limited memory (in comparison to GPU memory)

eisenhauer commented 2 years ago

@franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060 and PR #3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.

franzpoeschel commented 2 years ago

@franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060 and PR #3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.

Huh? I used Flush() for implementing this and it worked? But I can change it to PerformDataWrite() to be on the safe side, thanks for the note!

eisenhauer commented 2 years ago

The change to PerformDataWrite() just got merged in to master days ago (in anticipation today’s pre-release feature freeze).

From: Franz Pöschel @.> Date: Monday, February 28, 2022 at 11:35 AM To: ornladios/ADIOS2 @.> Cc: Eisenhauer, Greg S @.>, Mention @.> Subject: Re: [ornladios/ADIOS2] Make Engine::Flush collective and use it to allow dumping data to disk during a step (#2649)

@franzpoeschelhttps://github.com/franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060https://github.com/ornladios/ADIOS2/issues/3060 and PR #3078https://github.com/ornladios/ADIOS2/pull/3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.

Huh? I used Flush() for implementing this and it worked? But I can change it to PerformDataWrite() to be on the safe side, thanks for the note!

— Reply to this email directly, view it on GitHubhttps://github.com/ornladios/ADIOS2/issues/2649#issuecomment-1054441314, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHF7GNNNYWPGCGO2SENM73U5OP4PANCNFSM4ZMXKCFQ. You are receiving this because you were mentioned.Message ID: @.***>

franzpoeschel commented 2 years ago

Ah, got it Then I'll change it, thank you :)

pnorbert commented 2 years ago

@franzpoeschel If you use Sync Puts, BP5 chunking size by default is 16MB so increasing this with parameter BufferChunkSize would decrease the number of writes necessary per process. Up to 2GB if possible.

franzpoeschel commented 2 years ago

Thank you for this hint @pnorbert, this made a noticeable difference: https://github.com/ComputationalRadiationPhysics/picongpu/pull/4002#issuecomment-1055500122

pnorbert commented 2 years ago

I don't really understand those memory graphs, please explain that at our next meeting. I especially worry about setup 3, where the chunk list grows to 68GB, while it was kept in check at any other runs. I don't understand what is going on there.

franzpoeschel commented 2 years ago

I don't really understand those memory graphs, please explain that at our next meeting. I especially worry about setup 3, where the chunk list grows to 68GB, while it was kept in check at any other runs. I don't understand what is going on there.

Sure, we can talk about it in more detail But in short, there's no need to worry about figure 3 as the behavior there is exactly the expected behavior, because I did not use the PerformDataWrite() feature in that run. So, the memory builds up until the end of the step and then everything is flushed at once. Increases memory usage, but saves a little bit of time by only doing IO once.

Being able to go from (3) to (4) was exactly the point of this issue.

franzpoeschel commented 2 years ago

Ah, I see now that it grows extremely high. I assume that this might be the fault of setting the BufferChunkSize to 2GB? I believe that Heaptrack tracks the memory by hijacking malloc, so this shows the virtual memory usage. I should try out if I can find out anything about real memory usage which should be lower since most of the allocated memory is not used.

franzpoeschel commented 2 years ago

I confirmed the above now, see here

pnorbert commented 2 years ago

Thank you for confirming this. How many processes did this together?

franzpoeschel commented 2 years ago

This was a single process for now, I have not tested this for scalability yet. Also, do I understand it correctly from the Heaptrack output that a new chunk is created for every single Put() in BP5?

pnorbert commented 2 years ago

It should not do that, unless every Put passes a data with size between 1/2..1 x the chunk size. It should pack multiple puts into one chunk. When it cannot pack any more puts into the chunk, it downsizes (realloc) that chunk to its final size and then allocates a new chunk. Flush empties the buffer and next Put starts again from scratch (allocate a new chunk).

franzpoeschel commented 2 years ago

Then I'm a bit confused by the overallocation seen in this benchmark (lefthand) Bildschirmfoto vom 2022-03-01 15-22-38

Could this be triggered by running PerformPuts() frequently (which we do in PIConGPU)?

pnorbert commented 2 years ago

Do you use deferred Put(), and than call PerformPuts()? Looking at the buffering code, I think it is buggy in this case and allocates a brand new chunk for each deferred pointer (during PerformPuts) and never downsizes the previous buffer. So this is a bug here.

franzpoeschel commented 2 years ago

Do you use deferred Put(), and than call PerformPuts()?

Yes. So I guess that this is exactly the reason. Should we switch to Sync Put() in BP5? (Put modes were something that I wanted to discuss in the next meeting anyway) Is there any reason not to use Sync Put()?

pnorbert commented 2 years ago

The PR #3084 should fix this issue. Can you please test it when it is merged?

pnorbert commented 2 years ago

Actually, in BP5, PerformPuts() is kinda counterproductive. BP5 is the first engine that can take advantage of the deferred puts.

BP5 can write the Deferred user pointers directly to disk or to aggregator, and save memory. PerformPuts() forces a copy into the internal buffer and thus makes a copy.

Sync puts perform this copy immediately to give back the user the pointer for reuse.

Since BP4 was always calling PerformPuts() in EndStep(), using deferred vs sync puts did not really change anything in memory consumption and performance. Calling PerformPuts() in BP5 equalizes the two cases as well but then what's the point of using deferred puts?

The downside of using deferred puts can be too many small writes to disk from a process, instead of a single giant buffer write. This is not good. MinDeferredSize ensures buffering the smaller puts into internal buffers and only keep the larger deferred puts in the list. The 4MB default value is not useful for an extreme app like PIConGPU unless you have only variables smaller than 4MB besides the giant variables. I would buffer the medium size ones too. We need to learn in the future what value is good enough for SSD-based file systems and object-stores (could be smaller than current disk-based systems)

I think the preferred order of put calls is this:

  1. use variable:: SetMemorySpace() and pass a GPU memory pointer and copy into ADIOS buffer on host using sync put
  2. deferred put but use a big MinDeferredSize to keep big chunks around.
  3. sync put for everything
pnorbert commented 2 years ago

Merged...

franzpoeschel commented 2 years ago

Hi Norbert, your PR did fix the large memory consumption (top row): Bildschirmfoto vom 2022-03-03 15-13-33

Thank you for the detailed description of the Put() modes in BP5! In principle, we use the Span API inside PIConGPU, but so far we only activated it in the BP4 engine. After reading your notes, I figured that I could try activating it for BP5 too, the result was a large runtime improvement (bottom row). In my tests, this is actually faster than BP4. I assume that's because BP5 does not initialize 20Gb of RAM with zeros, though I'm a bit surprised that the difference is so large. But maybe it's also skewed by running under Heaptrack.

I think, with the Span approach we can also avoid having to deal with MinDeferredSize probably?

pnorbert commented 2 years ago

Two big improvement steps! Thank you!

pnorbert commented 2 years ago

Using Span should eliminate the need to worry about MinDeferredSize. The only thing to make sure that BufferChunkSize is big. The new 128MB default is still tiny for your use case.

pnorbert commented 2 years ago

BTW, BP5 has BufferVType="malloc", with the same memory management as in BP4 in principle but with using malloc/realloc instead of vector.resize. That would be the closest to see how much initializing memory with zeros slows down BP4.

franzpoeschel commented 2 years ago

Yes, true I think we should just make the maximum 2Gb the default in PIConGPU

Also, for users of openPMD outside of PIConGPU, we don't use PerformPuts in Streaming mode (except if users explicitly call series.flush()), so we should be prepared to support efficient usage of BP5 hopefully for everyone

BTW, BP5 has BufferVType="malloc", with the same memory management as in BP4 in principle but with using malloc/realloc instead of vector.resize. That would be the closest to see how much initializing memory with zeros slows down BP4.

So I would use InitialBufferSize just like in BP4 then? (Would a single buffer greater than 2Gb even work with malloc?) Generally, I don't think that initializing the buffer has a great impact at scale, because it happens once at the beginning of the simulation (and IO effects are more important at higher scale also), but this might be interesting still for combining the advantages of BP4 and BP5.

pnorbert commented 2 years ago

Yes, same parameter and logic. I am not aware if malloc has 31 bit limitations and if std::vector has a better allocation routine by default. This logic (and BP3/4) assumes you can allocate a gigantic contiguous memory. Chunking is an alternative for when that's not the case (it was not added to BP for performance)

franzpoeschel commented 2 years ago

I am not aware if malloc has 31 bit limitations and if std::vector has a better allocation routine by default.

I had something wrong remembered, malloc limitations are system dependent, but on my system it can easily above 2Gb.