Closed franzpoeschel closed 2 years ago
Just wanted to give a textual thumb's up to this feature request. We'll keep it in mind with the new serializer design...
So, @franzpoeschel we have this basic functionality in the BP5 Writer engine now. That is, in the middle of a timestep BP5 has the ability to move to disk all the data that has already been subject to a put(). This can happen multiple times in a single step, emptying internal data buffers and also capturing all the data in Deferred Put()s. But BP5 works differently than BP3/4 so we've been discussing how to best expose the functionality. There are really three choices: 1) Make it triggered by some form of the existing Flush() functionality, 2) make PerformPuts() do it (which the name kind of implies), or 3) create a new interface like FlushData() that does it. I lean towards 2, for reasons I'll describe, but I think it's not a slam dunk so we thought we'd ask your input. First, the problem with 1, using existing Flush(), is that the existing flush() really is more to push whole timesteps to disk when they have been queued due to timestep aggregation. That is, it is best used between timesteps, not in the middle of one so we're really changing how it is used, as well as what it does. I think PerformPuts() is a better bet though. Currently in BP3/4, what it does is to copy data from Deferred puts into internal ADIOS buffers. This isn't terrible because BP3/4 eventually copy all data into internal buffers before moving it to disk. However, that isn't true of BP5. There we are trying to use vector I/O to do zero copy on a deferred Put() when moving its data to disk. So keeping the old semantics of "PerformPuts()" just copies deferred data internally so you can reuse those buffers is kind of silly. If you wanted to be able to reuse those buffers, you could have just done a Sync Put(). So I tend to think it makes more sense for PerformPuts() (in the BP5 writer only) actually perform the puts, that is, move data to disk. This involves making it MPI-collective, which it currently is not, but the name really makes sense for what it actually is doing, at least to me. Of course, the third alternative is to just create some new method in the interface and leave both of those older interfaces alone. FlushData() is one candidate name as reasonably descriptive, but open to others as well. Anyhow, your thoughts?
Hello @eisenhauer, sorry for the late reply, I was on vacation last week. It's great to hear that this is coming around now :)
I wasn't really aware of these semantics for Engine::Flush()
. Is this documented somewhere or is there an example for how to use this? But in this case, it might be a bad candidate for exposing this functionality, yes.
So far, our workflow in the ADIOS2 backend of openPMD is to use Deferred
Putting for everything, and then to use either PerformPuts
(within a step) or EndStep
(at the end of a step) to run the operations and to free user buffers. I like the suggestion to include the new functionality within PerformPuts
as it seems to fit well in our workflow and might lead to a solution that would not require lots of changes in openPMD and user code. What makes me a bit hesitant is the idea to make an API call collective that was formerly not collective.
I have some questions:
PerformPuts
still be available for the other backends, with the old behavior?PerformPuts
be collective? Always? Only when opening a new step? Something else?PerformPuts
the semantics: "Process user buffers and give them back to the user. What happens with data afterwards is engine specific, some (BP4) may copy data to internal buffers, others (BP5) may write data to disk right away."I think my preference really depends on the answer to question 2. If the answer is "always collective", I have a feeling that this would break compatibility with code using PerformPuts
noncollectively. Maybe a flag PerformPuts(Mode::Collective)
?
RE: Documentation on Engine::Flush(): What, a comment in the header file isn't enough for you? Or the IO::FlushAll() and ADIOS::FlushAll()? Unfortunately they aren't documented per se, and I only came to know about conflicts of usage when I ran into usages of Flush in the tests...
WRT your questions:
I think there are two downsides to making this functionality available via PerformPuts(). One is that some code that we know of always does PerformPuts() right before EndStep(). That is completely innocuous for BP3/4, and in BP5 has the side effect of slightly increasing the metadata size (unless I figure out a way to optimize out that extra data). The second is that yes, should anyone be doing PerformPuts in such a way that making it collective would break them, then they'd have to modify that before using BP5. Whether or not anyone might be doing that out there is unknown, and really we can't raise an exception or anything. They'd find out when they had some of their ranks hanging in PerformPuts() waiting for other calls. But maybe that's clear enough... If not, then we'd have to introduce some new call and only suitably modified code would benefit from this.
So, in conclusion, putting that functionality in PerformPuts
makes sense semantically, at the cost of maybe breaking compatibility with old code. I'll put this point in the agenda for this evening's collaboration meeting so we can discuss how we should go forward.
Conclusion from the VC:
The chance of breaking compatibility is too large to put this functionality behind Engine::PerformPuts()
. The openPMD-api currently uses this call noncollectively. If a user installs a new version of ADIOS2 together with an old version of openPMD, this may result in hanging applications due to the changed meaning of PerformPuts()
.
The current meaning of PerformPuts()
is similar to what this feature request suggests, but it differs from the current behavior in performance: The new feature is a collective IO operation. In an IO framework such as ADIOS2, it is justified to make such a difference visible in the API. Due to the semantical similarity, it is still justified to put this behavior into PerformPuts()
, so an extra flag seems appropriate.
Mockup suggestions:
enum class PutMode{ NonCollective, Collective };
void PerformPuts( PutMode mode = PutMode::NonCollective );
Thanks for the summary! :)
Just naming wise, I am not sure we should call this "collective"/"non-collective". What we try to express is if we flush the buffers (given a performance hit) to disk.
This could be done either as you example, but describing the functionality as:
enum class BufferDraining{ Lazy, Immediately };
void PerformPuts( BufferDraining mode = BufferDraining::Lazy );
or as separate API calls:
PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
PerformPuts(); // Non-Collective
// ...
DrainBuffers(): // Collective
Personally, I slightly favor a separate API call since that one does not change the MPI behavior of PerformPuts()
. Also, a PerformPuts(BufferDraining::Immediately)
would flush the buffer of itself and all previous puts, which is a bit obfuscated compared to a separate API call.
Either way, the same thing can be expressed, though.
To add my $.02:
I think it's important to know which API calls are collective vs those that are not. Ideally, this should not depend on the underlying engines, since one of the strengths of ADIOS2 is the choice of engines that can be easily switched at run-time, but if that switching then leads to surprising hangs, it's not really worth that much. On the naming, I'd be kinda in favor of following the HDF5 terminology, ie., independent vs collective.
It is fundamentally fine to me to change the collective-ness of given API call based upon an argument that declares whether it's called collectively or not.
What I think it not great is to mingle collectiveness with differing semantics. If PerformPuts is to ensure that the pointers/buffers previously passed to Put() can be reused after its return, that's one semantic issue, and I think it should keep this meaning.
That's different from the semantics of ensuring that data actually makes it to disk, so I think a different API call is in order for that. This latter call (e.g., DrainBuffers
) may have its own collective-ness requirements, and ideally so consistently across engines, but that's secondary IMO.
If you see it justified to give this its own call, I'm also open to that, it's your API after all ;)
For us, the important part is to not break compatibility in the way that not distinguishing collective and non-collective PerformPut()ing would do.
FYI, EndStep
calls Flush
internally, see here. Flush
does local data flushing, whereas metadata in BP3,BP4 is collective by default (which makes flushed data non-usable until proper collective metadata writing).
Only in BP3 we can skip collective metadata writing (it's expensive and done for checkpoint-restart mostly with adios1.x bpmeta
used for reconstruction), that's the trade-off of BP4 as metadata is only "appended" at the cost of EndStep
being collective.
One of the reasons Flush
is not fully advertised is that's it's one of those internal functions required publicly by an application that not all engines are compliant with.
I'd just keep PerformPuts
as-is (it's already being used in production with a defined semantics across all engines), and reuse/rethink Flush
to allow the option to make it collective without ending a step (the challenge would be to write partial step metadata, although the BP format is flexible in that sense). EndStep
is well layered, so it's shouldn't be difficult to reuse existing functionality, but to test. Hope it helps.
Thanks for the details. Yes, our problem can be summarized as we have workflows where we cannot call EndStep
early enough to not run out of memory.
@ax3l sounds like your workflows gotta Flush
:)
First results of this in PIConGPU
In this test run, this allowed saving ~15Gb of peak main memory (i.e. the entire InitialBufferSize) at a cost of increasing the runtime. BP5 still seems noticeably slower than BP4, even without this feature, but I have not really tuned things for BP5 yet.
Thanks a lot for implementing this! This will be a very important feature for us on Frontier because of that system's rather limited memory (in comparison to GPU memory)
@franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060 and PR #3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.
@franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060 and PR #3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.
Huh? I used Flush() for implementing this and it worked? But I can change it to PerformDataWrite() to be on the safe side, thanks for the note!
The change to PerformDataWrite() just got merged in to master days ago (in anticipation today’s pre-release feature freeze).
From: Franz Pöschel @.> Date: Monday, February 28, 2022 at 11:35 AM To: ornladios/ADIOS2 @.> Cc: Eisenhauer, Greg S @.>, Mention @.> Subject: Re: [ornladios/ADIOS2] Make Engine::Flush collective and use it to allow dumping data to disk during a step (#2649)
@franzpoeschelhttps://github.com/franzpoeschel Just wanted to make sure you saw ADIOS2 Issue #3060https://github.com/ornladios/ADIOS2/issues/3060 and PR #3078https://github.com/ornladios/ADIOS2/pull/3078. Ended up not sticking with the Flush() name because of the conflict with prior functionality. So now it's PerformDataWrite(), and it is indeed a no-op on any engine except BP5 file writer.
Huh? I used Flush() for implementing this and it worked? But I can change it to PerformDataWrite() to be on the safe side, thanks for the note!
— Reply to this email directly, view it on GitHubhttps://github.com/ornladios/ADIOS2/issues/2649#issuecomment-1054441314, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHF7GNNNYWPGCGO2SENM73U5OP4PANCNFSM4ZMXKCFQ. You are receiving this because you were mentioned.Message ID: @.***>
Ah, got it Then I'll change it, thank you :)
@franzpoeschel If you use Sync Puts, BP5 chunking size by default is 16MB so increasing this with parameter BufferChunkSize would decrease the number of writes necessary per process. Up to 2GB if possible.
Thank you for this hint @pnorbert, this made a noticeable difference: https://github.com/ComputationalRadiationPhysics/picongpu/pull/4002#issuecomment-1055500122
I don't really understand those memory graphs, please explain that at our next meeting. I especially worry about setup 3, where the chunk list grows to 68GB, while it was kept in check at any other runs. I don't understand what is going on there.
I don't really understand those memory graphs, please explain that at our next meeting. I especially worry about setup 3, where the chunk list grows to 68GB, while it was kept in check at any other runs. I don't understand what is going on there.
Sure, we can talk about it in more detail But in short, there's no need to worry about figure 3 as the behavior there is exactly the expected behavior, because I did not use the PerformDataWrite() feature in that run. So, the memory builds up until the end of the step and then everything is flushed at once. Increases memory usage, but saves a little bit of time by only doing IO once.
Being able to go from (3) to (4) was exactly the point of this issue.
Ah, I see now that it grows extremely high. I assume that this might be the fault of setting the BufferChunkSize to 2GB? I believe that Heaptrack tracks the memory by hijacking malloc, so this shows the virtual memory usage. I should try out if I can find out anything about real memory usage which should be lower since most of the allocated memory is not used.
I confirmed the above now, see here
Thank you for confirming this. How many processes did this together?
This was a single process for now, I have not tested this for scalability yet. Also, do I understand it correctly from the Heaptrack output that a new chunk is created for every single Put() in BP5?
It should not do that, unless every Put passes a data with size between 1/2..1 x the chunk size. It should pack multiple puts into one chunk. When it cannot pack any more puts into the chunk, it downsizes (realloc) that chunk to its final size and then allocates a new chunk. Flush empties the buffer and next Put starts again from scratch (allocate a new chunk).
Then I'm a bit confused by the overallocation seen in this benchmark (lefthand)
Could this be triggered by running PerformPuts() frequently (which we do in PIConGPU)?
Do you use deferred Put(), and than call PerformPuts()? Looking at the buffering code, I think it is buggy in this case and allocates a brand new chunk for each deferred pointer (during PerformPuts) and never downsizes the previous buffer. So this is a bug here.
Do you use deferred Put(), and than call PerformPuts()?
Yes. So I guess that this is exactly the reason. Should we switch to Sync Put() in BP5? (Put modes were something that I wanted to discuss in the next meeting anyway) Is there any reason not to use Sync Put()?
The PR #3084 should fix this issue. Can you please test it when it is merged?
Actually, in BP5, PerformPuts() is kinda counterproductive. BP5 is the first engine that can take advantage of the deferred puts.
BP5 can write the Deferred user pointers directly to disk or to aggregator, and save memory. PerformPuts() forces a copy into the internal buffer and thus makes a copy.
Sync puts perform this copy immediately to give back the user the pointer for reuse.
Since BP4 was always calling PerformPuts() in EndStep(), using deferred vs sync puts did not really change anything in memory consumption and performance. Calling PerformPuts() in BP5 equalizes the two cases as well but then what's the point of using deferred puts?
The downside of using deferred puts can be too many small writes to disk from a process, instead of a single giant buffer write. This is not good. MinDeferredSize
ensures buffering the smaller puts into internal buffers and only keep the larger deferred puts in the list. The 4MB default value is not useful for an extreme app like PIConGPU unless you have only variables smaller than 4MB besides the giant variables. I would buffer the medium size ones too. We need to learn in the future what value is good enough for SSD-based file systems and object-stores (could be smaller than current disk-based systems)
I think the preferred order of put calls is this:
MinDeferredSize
to keep big chunks around. Merged...
Hi Norbert, your PR did fix the large memory consumption (top row):
Thank you for the detailed description of the Put() modes in BP5! In principle, we use the Span API inside PIConGPU, but so far we only activated it in the BP4 engine. After reading your notes, I figured that I could try activating it for BP5 too, the result was a large runtime improvement (bottom row). In my tests, this is actually faster than BP4. I assume that's because BP5 does not initialize 20Gb of RAM with zeros, though I'm a bit surprised that the difference is so large. But maybe it's also skewed by running under Heaptrack.
I think, with the Span approach we can also avoid having to deal with MinDeferredSize probably?
Two big improvement steps! Thank you!
Using Span should eliminate the need to worry about MinDeferredSize. The only thing to make sure that BufferChunkSize is big. The new 128MB default is still tiny for your use case.
BTW, BP5 has BufferVType="malloc", with the same memory management as in BP4 in principle but with using malloc/realloc instead of vector.resize. That would be the closest to see how much initializing memory with zeros slows down BP4.
Yes, true I think we should just make the maximum 2Gb the default in PIConGPU
Also, for users of openPMD outside of PIConGPU, we don't use PerformPuts in Streaming mode (except if users explicitly call series.flush()
), so we should be prepared to support efficient usage of BP5 hopefully for everyone
BTW, BP5 has BufferVType="malloc", with the same memory management as in BP4 in principle but with using malloc/realloc instead of vector.resize. That would be the closest to see how much initializing memory with zeros slows down BP4.
So I would use InitialBufferSize
just like in BP4 then? (Would a single buffer greater than 2Gb even work with malloc?)
Generally, I don't think that initializing the buffer has a great impact at scale, because it happens once at the beginning of the simulation (and IO effects are more important at higher scale also), but this might be interesting still for combining the advantages of BP4 and BP5.
Yes, same parameter and logic. I am not aware if malloc has 31 bit limitations and if std::vector has a better allocation routine by default. This logic (and BP3/4) assumes you can allocate a gigantic contiguous memory. Chunking is an alternative for when that's not the case (it was not added to BP for performance)
I am not aware if malloc has 31 bit limitations and if std::vector has a better allocation routine by default.
I had something wrong remembered, malloc limitations are system dependent, but on my system it can easily above 2Gb.
Related issues: #1891 and #2629.
Suggestion: The call
Engine::Flush()
should be made MPI-collective and its effect should be to drain data current held in ADIOS buffers to disk. In streaming engines, this call would either have no effect or throw an exception.Why is this feature important? Multiple reasons:
This is to some extent what the documentation for
Engine::Flush
promises to do anyway:(from here)
Engine::Flush
is somewhat defective. Its current effect is that all data written up toEngine::Flush
in one step is thrown away and only the data betweenEngine::Flush
andEngine::EndStep
will appear in the written file. I don't know if this somehow intended, but I can't think of a use for it.Engine::EndStep
or byEngine::Close
. This makes it hard to control memory usage within one ADIOS step. Examples areEngine::PerformPuts
, feeding data to ADIOS bit by bit. If not specifying the correctInitialBufferSize
, this will lead to repeated resizing of ADIOS buffers, with negative effects on memory usage, as outlined in #2629. Also, the amount of data written in one step depends on user input and varies over steps, so the size of a step should be considered unbounded. By allowing intermediateEngine::Flush
es, we could put a bound on the amount of data held at any point in ADIOS. TheInitialBufferSize
would become easier to specify and not require prediction of the largest step to come in a simulation.Engine::Flush
doesn't require the same level of synchrony, this would become a lot easier.What is the potential impact of this feature in the community? Simulations with more diverse write patterns would be able to use ADIOS and have a more controlled memory usage. ADIOS optimizes towards writers with synchronous steps where one step fits fully into memory (per process) and data can be presented in a clear 2-phase workflow of (1) declare variables and enqueue Put() operations and (2) perform the enqueued operations. This feature request would allow to use ADIOS more easily in applications that do not fulfill these assumptions.
Is your feature request related to a problem? Please describe. Already outlined above.
Describe the solution you'd like and potential required effort Solution already described above. For the required effort, I am not sufficiently aware of the BP format intrinsics. I imagine the challenging points would be (1) draining the buffer several times within one step and (2) appending to a written step in the BP file. Also, slightly API breaking by changing the behavior of
Engine::Flush
.Describe alternatives you've considered and potential required effort This feature request achieves two goals:
InitialBufferSize
correctly. Needs to be either done by the user (bad UX, bad performance by default) or programmatically (requires duplication of logic in PIConGPU, only feasible if writing a new ADIOS file per step, otherwise the size of the largest step to come needs to be computed).Additionally, according to @pnorbert, the new serializer to come late 2021 will address the inefficient buffer resizing of the current serializer(s), so that will be a more permanent solution. This feature request would not cease to be useful by this point, since point (1) will remain relevant.