ornladios / ADIOS

The old ADIOS 1.x code repository. Look for ADIOS2 for new repo
https://csmd.ornl.gov/adios
Other
54 stars 40 forks source link

Slow read times with with multi-time step file using Python bindings #168

Open rmchurch opened 6 years ago

rmchurch commented 6 years ago

I have a bp file that has multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file)
key = f.var.keys()[0]
print key,f[key]
e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961)
%time data = f[key][...]
CPU times: user 394 ms, sys: 942 ms, total: 1.34 s                                                          
Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file)
print f[key]
<HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8">
%time data = f[key][...]
CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms
Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

pnorbert commented 6 years ago

Michael,

Can you make your file available for us at OLCF or NERSC? Is this a single bp file or a directory with many subfiles? This is the diagnostics written by a single process, right?

I just made a test file on my VM of 200 variables and 7000 steps (each variable is 5 by 5 2D array) and the read time is fast.

AdiosVar (varid=7, name='v001', dtype=dtype('int32'), ndim=2, dims=(5L, 5L), nsteps=7000, attrs=[]) 0.37046790123

This is my python test reader

!/usr/bin/python

import numpy import adios from timeit import default_timer as timer

f = adios.file('many_vars.bp') v = f.var['v001'] print v s=timer() data = v.read() e=timer() print(e-s) f.close()

Thanks Norbert

On Wed, Feb 28, 2018 at 4:35 PM, Michael Churchill <notifications@github.com

wrote:

I have a bp file that has a multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file) key = f.var.keys()[0] print key,f[key] e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961) %time data = f[key][...] CPU times: user 394 ms, sys: 942 ms, total: 1.34 s Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file) print f[key] <HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8"> %time data = f[key][...] CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRBOoqFe7mi8FpqYhJzh8jENqLlNks5tZcadgaJpZM4SXYi1 .

rmchurch commented 6 years ago

The data is on Edison, /scratch2/scratchdirs/rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

pnorbert commented 6 years ago

Okay, I see. ADIOS does not cache it. It is the system that caches data. With the current file format and read implementation, there are 7000 consecutive seeks and reads to get the array with all steps, and this is slow for remote disks. The next time it's reading from cache and it's much faster.

I wonder where the hdf5 file was when you got the data in a few milliseconds.

On Mon, Mar 5, 2018 at 11:28 AM, Michael Churchill <notifications@github.com

wrote:

The data is on Edison, /scratch2/scratchdirs/ rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168#issuecomment-370477023, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRMi_031GElESVIt1OwE77Ce1f_Iks5tbWe0gaJpZM4SXYi1 .

rmchurch commented 6 years ago

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

pnorbert commented 6 years ago

I meant, was it in cache or not.

On Mon, Mar 5, 2018 at 1:01 PM, Michael Churchill notifications@github.com wrote:

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168#issuecomment-370507441, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLam7hvDA_jJ-2Qz4G9YzMAMlZJtQks5tbX1ngaJpZM4SXYi1 .

rmchurch commented 6 years ago

I don't think so. I tried both the bp and h5 today, after having accessed them last week (so I assume both were out of cache by now). Both had the same read timings as before, and had the same characteristic that the 2nd read of the same data would take much less time (suggesting it was cached). The HDF5 data took about 100ms to read on the first read, whereas the bp file took 1 minute on the first read.