openPMD / openPMD-api

:floppy_disk: C++ & Python API for Scientific I/O
https://openpmd-api.readthedocs.io
GNU Lesser General Public License v3.0
138 stars 51 forks source link

Should Iteration::close(flush=true) write files? #986

Open eschnett opened 3 years ago

eschnett commented 3 years ago

What do you want to achieve, please describe.

I want to use openPMD to output data in the Einstein Toolkit. We are using AMReX and it's mesh refinement. I am currently experimenting with a first prototype of an openPMD-based I/O module for the Einstein Toolkit.

I looked at the examples and the API. I am puzzled by one thing: Files are created, but they remain empty (0 bytes) until the Series object is destroyed.

The code is approximately:

series = make_unique<openPMD::Series>("file.bp", openPMD::Access::CREATE, MPI_COMM_WORLD);
write_iters = make_unique<openPMD::WriteIterations>(series->writeIterations());
openPMD::Iteration iter = (*write_iters)[cctk_iteration];
{
  openPMD::Mesh mesh = iter.meshes[groupname];
  openPMD::Mesh mesh = iter.meshes[groupname];
  components.at(vi).resetDataset(dataset);
  components.at(vi).storeChunk(openPMD::shareRaw(ptr), offset, extent);
}
iter.close();
write_iters.reset();
series->flush();

I need to call series.reset() to have the data written to file. It is my impression that closing the iteration should already write all data to the file.

It is also my impression that it is more efficient to keep the Series object around. (I think this is implied from the ADIOS2 documentation.)

I installed openPMD-api via Spack a few days ago. I am using macOS. I am using ADIOS2 as back end.

eschnett commented 3 years ago

The complete code is here https://gist.github.com/eschnett/d891cebca607f178aa0a1344c615dac0 . Since it is part of the Einstein Toolkit, it won't compile on its own.

ax3l commented 3 years ago

Hi @eschnett,

Thanks for the detailed example!

It is my impression that closing the iteration should already write all data to the file.

I think that is reasonable with a few details to know about small, high-frequent data outputs (see below). Let's check this again, Iteration::close() is relatively new and made initially for data streaming.

It is also my impression that it is more efficient to keep the Series object around. (I think this is implied from the ADIOS2 documentation.)

Yes, keeping the Series around is generally recommended and will save ADIOS2 sub-file open/close as well as buffer reallocation.

Background & Mumbling

We implement quite a few options to output iteration data with respect to series of data over iterations. We can encode them in individual files (fileBased: "file_%T.bp"), inside a file (in groups or directory: groupBased, "file.bp") and have a new mode for ADIOS for the latter that encodes within ADIOS2 variables (variableBased: "file.bp" + Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );).

We document them in openPMD-standard but not yet well enough in the manual of openPMD-api, which should be fixed.

The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts). It's naming is a bit unlucky.

The output mode you used encodes the iterations in the file itself in groups (groupBased iteration encoding). If data is small here, ADIOS might indeed keep it in the buffers until we finalize it (latest: in ~Series) - ADIOS does that because it might combine small writes of many time steps into larger, more efficient writes.

As mentioned, we just added the Iteration::close() call, specifically for the newer modes, i.e. for data streaming. We could check if we can force a write at this point for the "traditional" output. Semantically, this just signals the backend that there won't be further modifications from the user and the extra argument flush = true again makes sure user-passed buffers can be re-used again. If a user writes truly little data per iteration, it is still more efficient (for ADIOS) to keep the data in a buffer and not write it just yet to disk.

If you like to try the latest dev branch that will go into 0.14.0 soon, you could experiment with the openPMD::IterationEncoding::variableBased encoding in ADIOS2, which is what ADIOS2 developers generally recommend us to do to encode our iterations. The writing API is identical to the streaming example, but use .bp as file ending and set Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );. At the end of an Iteration::close(), we end an ADIOS "Step" that triggers a write in this mode (cc @franzpoeschel).

ax3l commented 3 years ago

cc @franzpoeschel reading out latest changes to the flush() guarantees, we should probably call the Engine::Flush() now in ADIOS at those points. What do you think? :) https://openpmd-api.readthedocs.io/en/latest/usage/workflow.html

In the WarpX case, I call flush() a couple of times for temporary particle data concatenation operations before closing an iteration - a true call to Engine::Flush() would actually only be needed on Iteration::close(flush=true).

eschnett commented 3 years ago

The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts).

I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).

eschnett commented 3 years ago

I'm using IterationEncoding::variableBased now. This does flush the content to file.

ax3l commented 3 years ago

The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts).

I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).

Yes, that makes sense and I actually have a similar need. We should expose a function for that. Potentially on Iteration::close(flush=true). We could make this an option (and default to doing so on Iteration::close()).

franzpoeschel commented 3 years ago

Most of this has already been discussed, just a few notes:

We implement quite a few options to output iteration data with respect to series of data over iterations. We can encode them in individual files (fileBased: "file_%T.bp"), inside a file (in groups or directory: groupBased, "file.bp") and have a new mode for ADIOS for the latter that encodes within ADIOS2 variables (variableBased: "file.bp" + Series::setIterationEncoding( openPMD::IterationEncoding::variableBased );).

Theoretically, you can also (1) use the normal (groupbased) layout, (2) use the streaming API and (3) set {"adios2": {"engine":{"usesteps": true}}} as a JSON option. This is opt-in though since there may be edge cases and variable-based encoding is preferred as soon as it is stable.

We document them in openPMD-standard but not yet well enough in the manual of openPMD-api, which should be fixed.

Noted.

The flush() call itself guarantees only that you can use the passed "user-side" buffers again (similar to Async MPI contracts). It's naming is a bit unlucky.

The output mode you used encodes the iterations in the file itself in groups (groupBased iteration encoding). If data is small here, ADIOS might indeed keep it in the buffers until we finalize it (latest: in ~Series) - ADIOS does that because it might combine small writes of many time steps into larger, more efficient writes.

To put this stronger: ADIOS will keep it in the buffer until either closing a step or the whole file. There is currently no way around that.

As mentioned, we just added the Iteration::close() call, specifically for the newer modes, i.e. for data streaming. We could check if we can force a write at this point for the "traditional" output.

Relevant issue

cc @franzpoeschel reading out latest changes to the flush() guarantees, we should probably call the Engine::Flush() now in ADIOS at those points. What do you think? :)

See issue above. TLDR: Engine::Flush has no clearly-defined semantics and does unexpected things at the moment.

I need a stronger guarantee. I want to delete old checkpoint files when news ones are on the disk. I also want to ensure files are written safely in case the job aborts for some reason (out of queue time, writing data just before a known crash, etc.).

You might want to think about simply using filebased iteration layout for that by opening the filename "file%T.bp" instead. openPMD will then create a distinct file file0.bp, file1.bp, file2.bp, … per snapshot, allowing you to delete single checkpoints without affecting others.

We should expose a function for that. Potentially on Iteration::close(flush=true).

Iteration::close already does this everywhere it can. For simple group-based layout, we are limited by what functionality ADIOS2 exposes. Note our documentation that recommends against using groupbased iteration encoding in ADIOS2 for that reason.

eschnett commented 3 years ago

I am now using

    series->setIterationEncoding(openPMD::IterationEncoding::variableBased);

and the file is indeed written on iter.close().

However, memory is not released! I need to series.reset() to ensure that memory is released. The amount of memory retained is significant (about 20 GByte per process, 4 processes per node) and drives the simulation into swap otherwise.

This is the current development version of openPMD, and the current release of ADIOS2.

ax3l commented 3 years ago

Oh good point, that is another ADIOS2 feature - buffering as much as it can for larger write chunks. We expose the control of that buffer via JSON options in the Series constructor (last argument), can you try if that helps? https://openpmd-api.readthedocs.io/en/0.13.3/details/backendconfig.html#configuration-structure-per-backend

The parameter should be e.g. MaxBufferSize: 500Mb https://adios2.readthedocs.io/en/latest/engines/engines.html#bp4

We also have some ideas for future improvements with the ADIOS2 team about the dynamic size of this buffer: https://github.com/ornladios/ADIOS2/issues/1814 https://github.com/ornladios/ADIOS2/issues/2629

eschnett commented 3 years ago

This doesn't work: ADIOS2 aborts with

terminate called after throwing an instance of 'std::runtime_error'
  what():  ERROR: data size: 11115.284180 Mb is too large for adios2 bp MaxBufferSize=500.000000Mb, try increasing MaxBufferSize in\
 call to IO SetParameters in call to PerformPuts

I have no single variable that is that large. This must be the aggregate size of all variables, likely even over several processes combined. (I am using 4 processes, and Silo stores all data from a single process in 5 GByte.)

ax3l commented 3 years ago

I have no single variable that is that large. This must be the aggregate size of all variables, likely even over several processes combined. (I am using 4 processes, and Silo stores all data from a single process in 5 GByte.)

That's right, the buffer usage should be as large as the accumulated size of variables you write in the step from a rank and should not grow after Iteration::close() unless the next iteration you write is larger than the previous one in data.

Can you fit a copy of your data in RAM?

If you remove the upper cap given by MaxBufferSize again: If you close the whole ~Series instead of using Iteration::close(), does this work? In that case we might have to look if we forget to trigger something to drain the buffer.

eschnett commented 3 years ago

Yes, if I destroy the Series, the memory gets freed. I can then create a new Series for the next iteration.

(Since ADIOS2 cannot append to files, this also means that I need to create a separate file for each iteration.)

ax3l commented 3 years ago

That destroying the Series over Iteration::close() frees the memory sounds like a bug in our streaming API or in ADIOS2. Can you share a small reproducer with us?

Update: We should be able to reproduce this probably with our own example and replacing .sst with .bp: https://openpmd-api.readthedocs.io/en/0.13.3/usage/streaming.html#id1

eschnett commented 3 years ago

I don't have a small reproducer. This is with the Einstein Toolkit outputting the initial conditions for a binary black hole simulation. If you can reproduce this with your example, that would be convenient.

ax3l commented 3 years ago

@franzpoeschel let us not forget to get back to this, this memory issue might be from a regression in the streaming API.

@eschnett do you call flush() before the close()? Otherwise that might be the origin of the leak, because we do not yet support what you reported in #994

franzpoeschel commented 3 years ago

@eschnett Can you paste the output of bpls -D <your dataset>? You should not be seeing such a behavior if using ADIOS2 steps, so I assume that steps are somehow not used. The output of bpls would confirm that.

I assume that you call series.setIterationEncoding() before the first flush? Once data has been flushed, the encoding cannot be changed any more. (We should probably document that or throw an error if something like it is observed).

sbastrakov commented 2 years ago

I also recently faced a somewhat similar confusion. i feel this doc page is not fully clear with explaining what flush() actually does. Especially In short: operations requested by storeChunk() and loadChunk() must happen exactly at flush points. may be understood as file read/write actually happens there, not just as user can proceed from there.

franzpoeschel commented 2 years ago

We don't have full control about what exactly the backend does upon flush(). Some backends (HDF5) may immediately write the data to disk, others (ADIOS2) may stage it inside their own memory. There might in future be an added backend-specific flush() control because the new BP5 engine in ADIOS2 actually lets us choose. The only thing that we can really guarantee is that buffers are released after flushing, the rest is backend-specific.

sbastrakov commented 2 years ago

Yes, I understand that. My point was more that the documentation could be more explicit about it, as with writing it's not obvious that calling flush() does not necessarily mean the file operation is actually performed at this moment. E.g. last 3 PIConGPU-related people I discussed it with, all thought flush() == file write and if we are passed that on PIConGPU side, the file must have been written already.

franzpoeschel commented 2 years ago

Ah, ok. I've added it as a Todo to this PR because it fits well within it.