Slow down with repeated calls to ts.get_field() using io_reader

aveksler1 commented 1 year ago

Hi all,

I'm using openPMD-viewer to loop through the outputs of a WarpX simulation and perform a calculation on fields at every iteration. I noticed that as I loop through ts.iterations, the get_fields() call takes longer and longer. I'm running this script on a local cluster, and the ts.backend is openpmd-api.

The script where I am seeing slowdown looks like this. I am not able to recreate the problem using the openPMD example datasets. The dataset I am using (which I unfortunately cannot share) are 4000 iterations of E and B fields on a 72x72x144 grid. Every .bp directory is ~ 35 Mb, so around 140 Gb in total.

import time
from openpmd_viewer import OpenPMDTimeSeries

ts = OpenPMDTimeSeries(`data_path`)

iterations = ts.iterations[0:20]

for it in iterations:
    begin_time = time.time()
    Ex, info_Ex = ts.get_field( 'E', coord='x', iteration=it )
    end_time = time.time()
    print(f"Time to get fields: {end_time - begin_time:.4f} s")

This outputs:

Time to get fields: 0.0749 s
Time to get fields: 0.0759 s
Time to get fields: 0.0792 s
Time to get fields: 0.0864 s
Time to get fields: 0.0875 s
Time to get fields: 0.1019 s
Time to get fields: 0.1044 s
Time to get fields: 0.1032 s
Time to get fields: 0.1120 s
Time to get fields: 0.1157 s
Time to get fields: 0.1171 s
Time to get fields: 0.1154 s
Time to get fields: 0.1237 s
Time to get fields: 0.1297 s
Time to get fields: 0.1305 s
Time to get fields: 0.1358 s
Time to get fields: 0.1393 s
Time to get fields: 0.1521 s
Time to get fields: 0.1505 s
Time to get fields: 0.1534 s

and this scales worse and worse as I increase the length of iterations.

I tracked down the offending function to be get_data in io_reader/utilities, specifically the series.flush() call repeated for every chunk. Timing it shows it takes up the majority of the ts.get_fields() call.

Average time to flush series:     0.0086 s
Total time spent flushing series: 0.0688 s
Time to get fields:               0.0747 s 

Average time to flush series:     0.0090 s
Total time spent flushing series: 0.0722 s
Time to get fields:               0.0769 s 

Average time to flush series:     0.0105 s
Total time spent flushing series: 0.0838 s
Time to get fields:               0.0893 s 

Average time to flush series:     0.0099 s
Total time spent flushing series: 0.0788 s
Time to get fields:               0.0840 s 

Average time to flush series:     0.0107 s
Total time spent flushing series: 0.0858 s
Time to get fields:               0.0900 s 

Average time to flush series:     0.0115 s
Total time spent flushing series: 0.0917 s
Time to get fields:               0.0957 s 

Average time to flush series:     0.0116 s
Total time spent flushing series: 0.0927 s
Time to get fields:               0.0977 s 

Average time to flush series:     0.0129 s
Total time spent flushing series: 0.1029 s
Time to get fields:               0.1083 s 

Average time to flush series:     0.0129 s
Total time spent flushing series: 0.1036 s
Time to get fields:               0.1081 s 

Average time to flush series:     0.0132 s
Total time spent flushing series: 0.1059 s
Time to get fields:               0.1104 s 

Average time to flush series:     0.0138 s
Total time spent flushing series: 0.1101 s
Time to get fields:               0.1139 s 

Average time to flush series:     0.0145 s
Total time spent flushing series: 0.1160 s
Time to get fields:               0.1201 s 

Average time to flush series:     0.0149 s
Total time spent flushing series: 0.1194 s
Time to get fields:               0.1234 s 

Average time to flush series:     0.0155 s
Total time spent flushing series: 0.1237 s
Time to get fields:               0.1278 s 

Average time to flush series:     0.0158 s
Total time spent flushing series: 0.1266 s
Time to get fields:               0.1305 s 

Average time to flush series:     0.0166 s
Total time spent flushing series: 0.1325 s
Time to get fields:               0.1366 s 

Average time to flush series:     0.0167 s
Total time spent flushing series: 0.1335 s
Time to get fields:               0.1386 s 

Average time to flush series:     0.0172 s
Total time spent flushing series: 0.1378 s
Time to get fields:               0.1426 s 

Average time to flush series:     0.0177 s
Total time spent flushing series: 0.1414 s
Time to get fields:               0.1454 s 

Average time to flush series:     0.0197 s
Total time spent flushing series: 0.1579 s
Time to get fields:               0.1630 s

I don't see this behavior when the openPMD outputs are .h5 files (but still openpmd-api backend). Any help would be much appreciated!

ax3l commented 1 year ago

Thank you for the details!

We chatted further on Slack and using HDF5 with group-based encoding should be way faster. I suspect that we accidentally close and open the same file all the time in openPMD-viewer - let's debug this!

In the meantime, you might be faster if you read with openPMD-api for now until we fix this here: https://openpmd-api.readthedocs.io/en/0.15.1/usage/firstread.html

Theodotus1243 commented 1 year ago

I have the same problem And also If I'm iterating over a lot of files with ts.get_field it says python errno = 24: Too many open files

franzpoeschel commented 1 year ago

I have the same problem And also If I'm iterating over a lot of files with ts.get_field it says python errno = 24: Too many open files

Does this still happen when you close Iterations with Iteration::close()? If you don't, you will at some point run into that problem. Note that we (currently) don't support re-opening an Iteration once closed, but it's on the agenda.

EDIT: Aah, just noticed that this is the openPMD-viewer. Yep, for that one we will probably need the functionality described above..

Will look at the rest of the issue tomorrow

ax3l commented 1 year ago

Here are the major ways how people iterate over data in openPMD-viewer:

Time Series Explicit Access

https://openpmd-viewer.readthedocs.io/en/latest/tutorials/1_Introduction-to-the-API.html#Preparing-the-API

from openpmd_viewer import OpenPMDTimeSeries
ts = OpenPMDTimeSeries('./example-2d/hdf5/')

N_iterations = len(ts.iterations)
for i in N_iterations:
    rho, info_rho = ts.get_field( iteration=ts.iteration[i], field='rho' )
    # or more fields or particles

Same used with explicit iterations in the GUI https://openpmd-viewer.readthedocs.io/en/latest/tutorials/3_Introduction-to-the-GUI.html

Iterate

See #400 https://openpmd-viewer.readthedocs.io/en/latest/tutorials/4_Particle_selection.html#Reconstructing-particle-trajectories https://openpmd-viewer.readthedocs.io/en/latest/api_reference/generic_interface.html#openpmd_viewer.OpenPMDTimeSeries.iterate

Note: after a first iterate, a user might want to do this again on the same time series. There are currently some limitations in series.close() about this @franzpoeschel .

Let's discuss if we can generalize this and potentially also add an iteration filter mode. For large processing tasks, this mode would allow for way more optimizations (e.g., visiting iterations once, freeing resources) than the explicit access in the example above.

guj commented 1 year ago

I created a small dataset to check things out. on frontier Total is 100G for 500steps. /lustre/orion/csc303/scratch/junmin/oct2023/Test/issue380/ With the same input, I stored in file based mode and group based mode, all in BP5

Using Remi's script , output reading file based mode: Processing file f/defaultBP5-node-ews_diags/diag1/ Time to get fields: 0.3335 s Time to get fields: 0.3615 s Time to get fields: 0.4046 s Time to get fields: 0.3177 s Time to get fields: 0.2831 s Time to get fields: 0.3511 s Time to get fields: 0.3737 s Time to get fields: 0.3670 s Time to get fields: 0.3218 s Time to get fields: 0.3098 s Time to get fields: 0.3574 s Time to get fields: 0.3913 s Time to get fields: 0.3470 s Time to get fields: 0.2941 s Time to get fields: 0.4090 s Time to get fields: 0.3446 s Time to get fields: 0.3648 s Time to get fields: 0.3623 s Time to get fields: 0.2748 s Time to get fields: 0.3575 s

output with group based mode: Processing file g/defaultBP5-node-ews_diags/diag1/ Time to get fields: 0.0535 s Time to get fields: 0.0020 s Time to get fields: 0.0018 s Time to get fields: 0.0017 s Time to get fields: 0.0009 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s Time to get fields: 0.0007 s

So I don't observe the pattern Remi observed. Let me try BP4 next

guj commented 1 year ago

I think Remi used BP4 with file encoding. group encoding is stable.

Processing file bp4-f/defaultBP4-node_diags/diag1/ Time to get fields: 0.0303 s Time to get fields: 0.0286 s Time to get fields: 0.0307 s Time to get fields: 0.0293 s Time to get fields: 0.0281 s Time to get fields: 0.0281 s Time to get fields: 0.0323 s Time to get fields: 0.0302 s Time to get fields: 0.1664 s Time to get fields: 0.0298 s Time to get fields: 0.3176 s Time to get fields: 0.0483 s Time to get fields: 0.1922 s Time to get fields: 0.0339 s Time to get fields: 0.1047 s Time to get fields: 0.0860 s Time to get fields: 0.0996 s Time to get fields: 0.0559 s Time to get fields: 0.3539 s Time to get fields: 0.2703 s

Processing file bp4-g/defaultBP4-node_diags/diag1/ Time to get fields: 6.3459 s Time to get fields: 0.3049 s Time to get fields: 0.3861 s Time to get fields: 0.3260 s Time to get fields: 0.2982 s Time to get fields: 0.3007 s Time to get fields: 0.3196 s Time to get fields: 0.3372 s Time to get fields: 0.3312 s Time to get fields: 0.3315 s Time to get fields: 0.3551 s Time to get fields: 0.3273 s Time to get fields: 0.3156 s Time to get fields: 0.3091 s Time to get fields: 0.3020 s Time to get fields: 0.3402 s Time to get fields: 0.4501 s Time to get fields: 0.4256 s Time to get fields: 0.3838 s Time to get fields: 0.3506 s

guj commented 1 year ago

So for the time being, switch to BP5 would be the easiest workaround.
For fast performance, group based would be better than file based.

Would be good to add read support of variable based. encoding in the openPMD-viewer. I think the performance should be similar to the group based.

franzpoeschel commented 1 year ago

@guj I'm not sure if you're really measuring a slowdown and not just slight performance fluctuations:

Figure_1

@aveksler1 This is obviously difficult to debug from afar, but let's try. Please do only steps 1-3 of this so far since those might already confirm the cause that I think most likely:

What versions of openPMD-api and ADIOS2 are you using? Do you know which engine you were using in ADIOS2? Is the reader serial or parallel?
Can you please pick the first and the last timestep (e.g. simData_0.bp and simData_2000.bp) and show the output of bpls -D simData_<timestep>.bp as well as ls simData_<timestep>.bp?
If you're using openPMD-api 0.15.2, can you run your code snippet with export OPENPMD_VERBOSE=1? If the output gets too large, maybe filter especially for OPEN_FILE and CLOSE_FILE, but ideally just redirect the verbose output to some file and upload it. If you don't use openPMD-api 0.15.2, maybe upgrade and then do that.
If it's not too much of a hassle, can you try adapting this script to your dataset and see if the performance troubles remain?
I'm guessing that your performance analysis that found out a performance bug in series.flush() was done inside Python. If you really want to go all the way, a performance analysis of the C++ side would be interesting. The most useful results are normally produced by Google Perftools. They don't require recompilation of the profiled code, but the installation of the package might be troublesome, so skip this if it's too much work.

There are two possible reasons that I currently see for this slowdown:

Your data producer might be using a technique such as load balancing, which generally leads to a more fragmented output as time passes by. As ADIOS2 uses a log-encoding (i.e. chunks are simply put to disk in the order and form that they come by), this leads to an increasingly fragmented output over time and hence to a slowdown in reading. As HDF5 transforms data into contiguous arrays at writing, HDF5 will have this slowdown at the write time, but not at the read time. If this is the cause for the slowdown, we can do some post-processing to transform you data into a contiguous form.
The (less likely) alternative is that this is caused by openPMD-viewer not close()ing iterations and this somewhat badly affects the ADIOS2 backend for whatever reason.

guj commented 1 year ago

@franzpoeschel the plot time you have is for group based encodings. The time from bp4 and bp5 of both file/group encodings are in the plot below.

time_comparison

franzpoeschel commented 11 months ago

Here are the major ways how people iterate over data in openPMD-viewer:

Time Series Explicit Access

https://openpmd-viewer.readthedocs.io/en/latest/tutorials/1_Introduction-to-the-API.html#Preparing-the-API
from openpmd_viewer import OpenPMDTimeSeries
ts = OpenPMDTimeSeries('./example-2d/hdf5/')
N_iterations = len(ts.iterations)
for i in N_iterations:
    rho, info_rho = ts.get_field( iteration=ts.iteration[i], field='rho' )
    # or more fields or particles
Same used with explicit iterations in the GUI https://openpmd-viewer.readthedocs.io/en/latest/tutorials/3_Introduction-to-the-GUI.html

Iterate

See #400 https://openpmd-viewer.readthedocs.io/en/latest/tutorials/4_Particle_selection.html#Reconstructing-particle-trajectories https://openpmd-viewer.readthedocs.io/en/latest/api_reference/generic_interface.html#openpmd_viewer.OpenPMDTimeSeries.iterate

Note: after a first iterate, a user might want to do this again on the same time series. There are currently some limitations in series.close() about this @franzpoeschel .

Let's discuss if we can generalize this and potentially also add an iteration filter mode. For large processing tasks, this mode would allow for way more optimizations (e.g., visiting iterations once, freeing resources) than the explicit access in the example above.

Given that the openPMD-viewer is in this context a wrapper about the openPMD-api, some optimizations are probably possible today already, like closing an iteration when the next one is accessed. The downside is that an iteration once closed can currently not be reopened again, which can be internally worked-around by closing and reopening the Series (ideally with {"defer_iteration_parsing": true} to avoid an overhead from that).

The access patterns themselves are probably fine, it's more a question of how they are implemented in the openPMD-viewer (or in future: in the openPMD-api).

As discussed offline, one important next step for the openPMD-api is a better support for typical access patterns, with such workarounds ideally being moved into the openPMD-api. My idea is that depending on the access configuration, you get an iterator from Series::readIterations() that has different features sets. When using Access::READ_ONLY (= random access), it would mostly be equivalent to using Series::iterations. Using READ_LINEAR would give you an iterator that accesses "one iteration at a time", i.e. jumping from ADIOS step to ADIOS step, jumping from file to file. In that mode, optimizations like closing files before opening the next, or even a cache of open files with some pre-loading might be thinkable. In the end, this would ideally close the gap we have today between the random-access API (which is too permissive) and the streaming API (which is too restrictive).

ax3l commented 8 months ago

While we generally think the solution will be to go to BP5, we currently see that file based encoding in BP5 is slower than BP4 in the benchmarks above.

@guj and @pnorbert are investigating further.

Issue #400 is a smaller reproducer of this issue, that contains HDF5 and ADIOS2 files.

ax3l commented 8 months ago

@aveksler1 if you still have the issue, can you profile your code like this? https://github.com/openPMD/openPMD-viewer/issues/400#issuecomment-1821318604

ax3l commented 8 months ago

The general solution is to use groupBased (g) or even variableBased (v) encoding with BP4 and BP5.

https://warpx.readthedocs.io/en/latest/usage/parameters.html#diagnostics-and-output

<diag_name>.openpmd_backend = bp
<diag_name>.openpmd_encoding = g  # soon: v

We plan for openpmd-api 0.16+ and according WarpX releases to go from BP4 to BP5 and to go from groupBased to variableBased encoding.

aveksler1 commented 8 months ago

@ax3l I'll run test cases with h5 and bp backends and different encodings, 'f' and 'g'. I'll compare the timings results from consecutive calls to ts.get_field(). If I still have the issue, I'll profile the code.

franzpoeschel commented 7 months ago

I have found the main cause for this slowdown. In ADIOS2IOHandler::flush(), a call to m_dirty.clear() is missing, meaning that every ADIOS2 file is flushed each time again. I have added a fix as part of https://github.com/openPMD/openPMD-api/pull/1598.

openPMD-api is currently not really well-optimized for Series with many iterations (my personal focus has been more on Series with few very big iterations), so researching this bug uncovered not only one single performance issue. The linked PR is hence not yet ready, as it contains other (less trivial) fixes, and additionally there are some weird findings that I still need to have a look at.

Until that PR is merged, a workaround might be adding ts.data_reader.series.iterations[it].close() in the reader code once you don't need the iteration any more. Note that you won't be able to access the iteration any more as there is no re-opening logic (yet).

aveksler1 commented 7 months ago

Thank you so much! I was able to avoid the problem by only using h5 output, but having the ability to rapidly post-process ADIOS2 files will be a great help when we start running simulations that benefit from ADIOS2 output.

ax3l commented 7 months ago

Performance analysis:

Action items:

[x] openPMD-api: fix internal regressions https://github.com/openPMD/openPMD-api/pull/1598
[ ] openPMD-api: enable to .close() and reopen an iteration (in both random access and .readIteration/linear read) https://github.com/openPMD/openPMD-api/issues/1606
[ ] openPMD-viewer: close out steps (e.g., in ts.iterate() or after we opened N further steps in random access mode)

ax3l commented 4 months ago

Per https://github.com/openPMD/openPMD-viewer/issues/380#issuecomment-1992081339

The specific issue is closed and will be released with openPMD-api 0.16.0.

Generally, the overhead of large series with many iterations in open should be linear. Please continue to report issues if you see otherwise :pray:

openPMD / openPMD-viewer