Closed duncan-brown closed 2 months ago
BTW, if someone wants to turn this on globally, the best way is to add a method to this class here (wrapper of the h5py.File class).
https://github.com/ligo-cbc/pycbc/blob/master/pycbc/io/hdf.py#L24
and simply modify the setitem / getitem. We've done a similar thing in the statmap code that could be moved here to also enable a certain compression level by default. Then everything just uses the interface like normal without having to worry about manually setting a lot of these things (with a worse user interface). We can then just transition the file open calls to this class as we find the time.
On Thu, Mar 16, 2017 at 10:22 PM, Ian Harry notifications@github.com wrote:
This came up at the "HDF formats" discussion.
HDF does not have top-level checksum a la frame files, but it does have a per dataset checking feature (fletcher32):
http://docs.h5py.org/en/latest/high/dataset.html#fletcher32-filter
We should probably just turn this on.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ligo-cbc/pycbc/issues/1525#issuecomment-287194910, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGrRuOPfNoakYsPAsJyTKjbz4O-aboZks5rmaf6gaJpZM4MenWk .
-- Dr. Alexander Nitz Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Callinstrasse 38 D-30167 Hannover, Germany Tel: +49 511 762-17097
Did anyone give this a try, and if so did it "just work"?
@ahnitz @spxiwh @titodalcanton did PyCBC
adopt HDF5 checksums for it's data files? If so, how well (or not) is that working? Thanks.
We did not implement the "per dataset checking feature" within PyCBC.
.... However, Pegasus implemented checksum testing on all data files (similar to our own checks on frame files). After we added an option to stop it testing symlinked files, this has been working nicely. We often disable this for development runs, so must remember to re-enable it for production runs! This doesn't help if using the files outside of pegasus though, unless one also extracts the checksums for all files (which is possible). We might consider enabling this feature, but it would not be ideal to have to enable it in every HDF call, as there are a lot of these throughout PyCBC. Some sort of environment variable to make this default would be much easier .... @GarethCabournDavies A point of discussion for the face-to-face tomorrow(?)
Are the Pegasus checksums maintained in a database or calculated on the fly before/after each file transfer? Note, in either case I think it would be worth while also adding internal checksums to the critical files as well. Thanks.
For enabling the h5py dataset checksu:
Could we insist on all calls to h5py.File
being through HFile
instead (this solves the 'every call' issue) and adding the _setitem_
/_getitem_
changes as Alex suggests - I don't see how to do that quite yet though
This could add the environment variable check into the HFile
init function, so that it is used, but hidden from the user
I also would need to check that e.g. file_object['new_dataset_name']
would also do this, as we don't always go through the create_dataset method
Can be closed by #4831
This came up at the "HDF formats" discussion.
HDF does not have top-level checksum a la frame files, but it does have a per dataset checking feature (fletcher32):
http://docs.h5py.org/en/latest/high/dataset.html#fletcher32-filter
We should probably just turn this on.