gwastro / pycbc

Core package to analyze gravitational-wave data, find signals, and study their parameters. This package was used in the first direct detection of gravitational waves (GW150914), and is used in the ongoing analysis of LIGO/Virgo data.
http://pycbc.org
GNU General Public License v3.0
315 stars 348 forks source link

Check to see if our HDF5 files have checksums turned on #1525

Closed duncan-brown closed 2 months ago

spxiwh commented 7 years ago

This came up at the "HDF formats" discussion.

HDF does not have top-level checksum a la frame files, but it does have a per dataset checking feature (fletcher32):

http://docs.h5py.org/en/latest/high/dataset.html#fletcher32-filter

We should probably just turn this on.

ahnitz commented 7 years ago

BTW, if someone wants to turn this on globally, the best way is to add a method to this class here (wrapper of the h5py.File class).

https://github.com/ligo-cbc/pycbc/blob/master/pycbc/io/hdf.py#L24

and simply modify the setitem / getitem. We've done a similar thing in the statmap code that could be moved here to also enable a certain compression level by default. Then everything just uses the interface like normal without having to worry about manually setting a lot of these things (with a worse user interface). We can then just transition the file open calls to this class as we find the time.

On Thu, Mar 16, 2017 at 10:22 PM, Ian Harry notifications@github.com wrote:

This came up at the "HDF formats" discussion.

HDF does not have top-level checksum a la frame files, but it does have a per dataset checking feature (fletcher32):

http://docs.h5py.org/en/latest/high/dataset.html#fletcher32-filter

We should probably just turn this on.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ligo-cbc/pycbc/issues/1525#issuecomment-287194910, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGrRuOPfNoakYsPAsJyTKjbz4O-aboZks5rmaf6gaJpZM4MenWk .

-- Dr. Alexander Nitz Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Callinstrasse 38 D-30167 Hannover, Germany Tel: +49 511 762-17097

stuartthebruce commented 5 years ago

Did anyone give this a try, and if so did it "just work"?

stuartthebruce commented 2 years ago

@ahnitz @spxiwh @titodalcanton did PyCBC adopt HDF5 checksums for it's data files? If so, how well (or not) is that working? Thanks.

spxiwh commented 2 years ago

We did not implement the "per dataset checking feature" within PyCBC.

.... However, Pegasus implemented checksum testing on all data files (similar to our own checks on frame files). After we added an option to stop it testing symlinked files, this has been working nicely. We often disable this for development runs, so must remember to re-enable it for production runs! This doesn't help if using the files outside of pegasus though, unless one also extracts the checksums for all files (which is possible). We might consider enabling this feature, but it would not be ideal to have to enable it in every HDF call, as there are a lot of these throughout PyCBC. Some sort of environment variable to make this default would be much easier .... @GarethCabournDavies A point of discussion for the face-to-face tomorrow(?)

stuartthebruce commented 2 years ago

Are the Pegasus checksums maintained in a database or calculated on the fly before/after each file transfer? Note, in either case I think it would be worth while also adding internal checksums to the critical files as well. Thanks.

GarethCabournDavies commented 1 year ago

For enabling the h5py dataset checksu:

Could we insist on all calls to h5py.File being through HFile instead (this solves the 'every call' issue) and adding the _setitem_/_getitem_ changes as Alex suggests - I don't see how to do that quite yet though

This could add the environment variable check into the HFile init function, so that it is used, but hidden from the user

GarethCabournDavies commented 1 year ago

I also would need to check that e.g. file_object['new_dataset_name'] would also do this, as we don't always go through the create_dataset method

GarethCabournDavies commented 2 months ago

Can be closed by #4831