kiyo-masui / bitshuffle

Filter for improving compression of typed binary data.
Other
215 stars 76 forks source link

Debugging corrupted bitshuffle data #127

Open telegraphic opened 1 year ago

telegraphic commented 1 year ago

Hi @kiyo-masui, we have some SETI data stored with bitshuffle compression, and a small number of files appear to have become corrupted. (Here is one, FYI: https://bldata.berkeley.edu/blpd30_datax2/blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5)

h5py is happy to open the file, but barfs if you try and access the bitshuffled dataset:

In [3]: a = h5py.File('blc03_guppi_59132_36704_HIP111595_0078.rawspec.0002.h5', 'r')
In [4]: a['data']
Out[4]: <HDF5 dataset "data": shape (279, 1, 65536), type "<f4">

In [5]: d = a['data'][:]
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-5-fee15ce54759> in <module>
----> 1 d = a['data'][:]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/opt/anaconda3/lib/python3.8/site-packages/h5py/_hl/dataset.py in __getitem__(self, args)
    571         mspace = h5s.create_simple(mshape)
    572         fspace = selection.id
--> 573         self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
    574
    575         # Patch up the output for NumPy

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.read()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_proxy.pyx in h5py._proxy.H5PY_H5Dread()

OSError: Can't read data (filter returned failure during read)

Do you think this file is recoverable (or partly recoverable)? Is there any way to turn on extra debug info in bitshuffle to help diagnose why it fails, and/or can bitshuffle skip over 'bad' chunks?

kiyo-masui commented 1 year ago

With a bit of hacking, I think you should be able to recover most of the data. First, I would just add print statements in bshuf_h5filter.c to figure out which exactly what function is returning an error code and the value of that code (the core functions of bitshuffle some some specific error codes with meanings).

telegraphic commented 1 year ago

Thanks @kiyo-masui, I'll take a look following that strategy.

As it's an issue with decompression, looks like here is a good place to start: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bshuf_h5filter.c#L183

Which calls: https://github.com/kiyo-masui/bitshuffle/blob/ac791b73d164068661566bbe4335fc7158372c49/src/bitshuffle.c#L238

And then each block is done with: https://github.com/kiyo-masui/bitshuffle/blob/fdfcd404ac8dcb828857a90c559d36d8ac4c2968/src/bitshuffle.c#L78