Open eschnett opened 3 years ago
The complete code is here https://gist.github.com/eschnett/d891cebca607f178aa0a1344c615dac0 . Since it is part of the Einstein Toolkit, it won't compile on its own.
Hi @eschnett,
Thanks for the detailed example!
It is my impression that closing the iteration should already write all data to the file.
I think that is reasonable with a few details to know about small, high-frequent data outputs (see below). Let's check this again, Iteration::close()
is relatively new and made initially for data streaming.
It is also my impression that it is more efficient to keep the Series object around. (I think this is implied from the ADIOS2 documentation.)
Yes, keeping the Series
around is generally recommended and will save ADIOS2 sub-file open/close as well as buffer reallocation.
We implement quite a few options to output iteration data with respect to series of data over iterations. We can encode them in individual files (fileBased
: "file_%T.bp"
), inside a file (in groups or directory: groupBased
, "file.bp"
) and have a new mode for ADIOS for the latter that encodes within ADIOS2 variables (variableBased
: "file.bp"
+ Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );
).
We document them in openPMD-standard but not yet well enough in the manual of openPMD-api, which should be fixed.
The flush()
call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts). It's naming is a bit unlucky.
The output mode you used encodes the iterations in the file itself in groups (groupBased
iteration encoding). If data is small here, ADIOS might indeed keep it in the buffers until we finalize it (latest: in ~Series
) - ADIOS does that because it might combine small writes of many time steps into larger, more efficient writes.
As mentioned, we just added the Iteration::close()
call, specifically for the newer modes, i.e. for data streaming. We could check if we can force a write at this point for the "traditional" output. Semantically, this just signals the backend that there won't be further modifications from the user and the extra argument flush = true
again makes sure user-passed buffers can be re-used again. If a user writes truly little data per iteration, it is still more efficient (for ADIOS) to keep the data in a buffer and not write it just yet to disk.
If you like to try the latest dev
branch that will go into 0.14.0
soon, you could experiment with the openPMD::IterationEncoding::variableBased
encoding in ADIOS2, which is what ADIOS2 developers generally recommend us to do to encode our iterations. The writing API is identical to the streaming example, but use .bp
as file ending and set Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );
. At the end of an Iteration::close()
, we end an ADIOS "Step" that triggers a write in this mode (cc @franzpoeschel).
cc @franzpoeschel reading out latest changes to the flush()
guarantees, we should probably call the Engine::Flush()
now in ADIOS at those points. What do you think? :)
https://openpmd-api.readthedocs.io/en/latest/usage/workflow.html
In the WarpX case, I call flush()
a couple of times for temporary particle data concatenation operations before closing an iteration - a true call to Engine::Flush()
would actually only be needed on Iteration::close(flush=true)
.
The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts).
I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).
I'm using IterationEncoding::variableBased
now. This does flush the content to file.
The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts).
I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).
Yes, that makes sense and I actually have a similar need. We should expose a function for that. Potentially on Iteration::close(flush=true)
. We could make this an option (and default to doing so on Iteration::close()
).
Most of this has already been discussed, just a few notes:
We implement quite a few options to output iteration data with respect to series of data over iterations. We can encode them in individual files (
fileBased
:"file_%T.bp"
), inside a file (in groups or directory:groupBased
,"file.bp"
) and have a new mode for ADIOS for the latter that encodes within ADIOS2 variables (variableBased
:"file.bp"
+Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );
).
Theoretically, you can also (1) use the normal (groupbased) layout, (2) use the streaming API and (3) set {"adios2": {"engine":{"usesteps": true}}}
as a JSON option. This is opt-in though since there may be edge cases and variable-based encoding is preferred as soon as it is stable.
We document them in openPMD-standard but not yet well enough in the manual of openPMD-api, which should be fixed.
Noted.
The
flush()
call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts). It's naming is a bit unlucky.The output mode you used encodes the iterations in the file itself in groups (
groupBased
iteration encoding). If data is small here, ADIOS might indeed keep it in the buffers until we finalize it (latest: in~Series
) - ADIOS does that because it might combine small writes of many time steps into larger, more efficient writes.
To put this stronger: ADIOS will keep it in the buffer until either closing a step or the whole file. There is currently no way around that.
As mentioned, we just added the
Iteration::close()
call, specifically for the newer modes, i.e. for data streaming. We could check if we can force a write at this point for the "traditional" output.
cc @franzpoeschel reading out latest changes to the flush() guarantees, we should probably call the Engine::Flush() now in ADIOS at those points. What do you think? :)
See issue above. TLDR: Engine::Flush
has no clearly-defined semantics and does unexpected things at the moment.
I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).
You might want to think about simply using filebased iteration layout for that by opening the filename "file%T.bp"
instead. openPMD will then create a distinct file file0.bp
, file1.bp
, file2.bp
, … per snapshot, allowing you to delete single checkpoints without affecting others.
We should expose a function for that. Potentially on
Iteration::close(flush=true)
.
Iteration::close
already does this everywhere it can. For simple group-based layout, we are limited by what functionality ADIOS2 exposes. Note our documentation that recommends against using groupbased iteration encoding in ADIOS2 for that reason.
I am now using
series->setIterationEncoding(openPMD::IterationEncoding::variableBased);
and the file is indeed written on iter.close()
.
However, memory is not released! I need to series.reset()
to ensure that memory is released. The amount of memory retained is significant (about 20 GByte per process, 4 processes per node) and drives the simulation into swap otherwise.
This is the current development version of openPMD, and the current release of ADIOS2.
Oh good point, that is another ADIOS2 feature - buffering as much as it can for larger write chunks.
We expose the control of that buffer via JSON options in the Series
constructor (last argument), can you try if that helps?
https://openpmd-api.readthedocs.io/en/0.13.3/details/backendconfig.html#configuration-structure-per-backend
The parameter should be e.g. MaxBufferSize: 500Mb
https://adios2.readthedocs.io/en/latest/engines/engines.html#bp4
We also have some ideas for future improvements with the ADIOS2 team about the dynamic size of this buffer: https://github.com/ornladios/ADIOS2/issues/1814 https://github.com/ornladios/ADIOS2/issues/2629
This doesn't work: ADIOS2 aborts with
terminate called after throwing an instance of 'std::runtime_error'
what(): ERROR: data size: 11115.284180 Mb is too large for adios2 bp MaxBufferSize=500.000000Mb, try increasing MaxBufferSize in\
call to IO SetParameters in call to PerformPuts
I have no single variable that is that large. This must be the aggregate size of all variables, likely even over several processes combined. (I am using 4 processes, and Silo stores all data from a single process in 5 GByte.)
I have no single variable that is that large. This must be the aggregate size of all variables, likely even over several processes combined. (I am using 4 processes, and Silo stores all data from a single process in 5 GByte.)
That's right, the buffer usage should be as large as the accumulated size of variables you write in the step from a rank and should not grow after Iteration::close()
unless the next iteration you write is larger than the previous one in data.
Can you fit a copy of your data in RAM?
If you remove the upper cap given by MaxBufferSize
again: If you close the whole ~Series
instead of using Iteration::close()
, does this work? In that case we might have to look if we forget to trigger something to drain the buffer.
Yes, if I destroy the Series
, the memory gets freed. I can then create a new Series
for the next iteration.
(Since ADIOS2 cannot append to files, this also means that I need to create a separate file for each iteration.)
That destroying the Series over Iteration::close()
frees the memory sounds like a bug in our streaming API or in ADIOS2.
Can you share a small reproducer with us?
Update: We should be able to reproduce this probably with our own example and replacing .sst
with .bp
:
https://openpmd-api.readthedocs.io/en/0.13.3/usage/streaming.html#id1
I don't have a small reproducer. This is with the Einstein Toolkit outputting the initial conditions for a binary black hole simulation. If you can reproduce this with your example, that would be convenient.
@franzpoeschel let us not forget to get back to this, this memory issue might be from a regression in the streaming API.
@eschnett do you call flush()
before the close()
? Otherwise that might be the origin of the leak, because we do not yet support what you reported in #994
@eschnett Can you paste the output of bpls -D <your dataset>
? You should not be seeing such a behavior if using ADIOS2 steps, so I assume that steps are somehow not used. The output of bpls
would confirm that.
I assume that you call series.setIterationEncoding()
before the first flush? Once data has been flushed, the encoding cannot be changed any more. (We should probably document that or throw an error if something like it is observed).
I also recently faced a somewhat similar confusion. i feel this doc page is not fully clear with explaining what flush()
actually does. Especially In short: operations requested by storeChunk() and loadChunk() must happen exactly at flush points.
may be understood as file read/write actually happens there, not just as user can proceed from there.
We don't have full control about what exactly the backend does upon flush()
. Some backends (HDF5) may immediately write the data to disk, others (ADIOS2) may stage it inside their own memory. There might in future be an added backend-specific flush()
control because the new BP5 engine in ADIOS2 actually lets us choose.
The only thing that we can really guarantee is that buffers are released after flushing, the rest is backend-specific.
Yes, I understand that. My point was more that the documentation could be more explicit about it, as with writing it's not obvious that calling flush()
does not necessarily mean the file operation is actually performed at this moment. E.g. last 3 PIConGPU-related people I discussed it with, all thought flush() == file write
and if we are passed that on PIConGPU side, the file must have been written already.
Ah, ok. I've added it as a Todo to this PR because it fits well within it.
What do you want to achieve, please describe.
I want to use openPMD to output data in the Einstein Toolkit. We are using AMReX and it's mesh refinement. I am currently experimenting with a first prototype of an openPMD-based I/O module for the Einstein Toolkit.
I looked at the examples and the API. I am puzzled by one thing: Files are created, but they remain empty (0 bytes) until the
Series
object is destroyed.The code is approximately:
I need to call
series.reset()
to have the data written to file. It is my impression that closing the iteration should already write all data to the file.It is also my impression that it is more efficient to keep the
Series
object around. (I think this is implied from the ADIOS2 documentation.)I installed
openPMD-api
via Spack a few days ago. I am using macOS. I am using ADIOS2 as back end.