Support saving data in HDF5 files

lnls-fac / apsuite

Accelerator Physics suite

MIT License

1 stars 2 forks source link

Support saving data in HDF5 files #243

Closed ericonr closed 1 year ago

ericonr commented 1 year ago

There are reasons to prefer saving as pickle, but when processing multiple big data files (FOFB SysId ones, in our case), simply loading them can take a long time, and takes up a lot of memory. Something like h5py can handle the data more efficiently, while still keeping most of the "dict of python values" aspect of pickled data.

This might make sense as a general feature, or specifically for SysId data, we can definitely discuss it :)

fernandohds564 commented 1 year ago

Nice suggestion, @ericonr! I think it's easy to generalize our saving and loading methods to handle both formats, according to the users' preference.

Just to register, could you provide here an example comparing both data formats in terms of saving and loading times and memory usage?

ericonr commented 1 year ago

Loading and processing 10 files, with 4 processes (multiprocessing):

HDF5: 47s, about 2.6GB of RAM per process
pickle: 1min 48s, about 3.5GB of RAM per process

The RAM difference is likely due to us only using part of the arrays, so the whole file doesn't have to be loaded into RAM (which is an advantage of HDF5).

I haven't measured saving times yet, though.