silx-kit / hdf5plugin

Set of compression filters for h5py
http://www.silx.org/doc/hdf5plugin/latest/
Other
62 stars 22 forks source link

Update hdf5-blosc2 to support b2nd #282

Closed ivilata closed 8 months ago

ivilata commented 8 months ago

This provides a few updates to support Blosc2 NDim NDim (2-level partitioning or b2nd, as introduced in Blosc2 2.7.0, see https://www.blosc.org/posts/blosc2-ndim-intro/) for arrays with rank >= 2. The code comes from PyTables master commit bb02c88 (where it passes all tests); part of it was already released in PyTables v3.9.0 and the rest will be in the next version (though it is completely compatible with the released one and not expected to change until then). A minor documentation fix and an updated unit test are also included.

For such b2nd-enabled arrays, each HDF5 chunk is stored as a Blosc2 multidimensional array with a fixed shape (even for margin chunks where data may not cover the whole chunk), consisting of a single inner chunk itself, and it also adds the HDF5 chunk rank and shape to stored filter parameters (so that further chunks added with the filter have consistent chunk and block sizes).

To enable reading datasets generated by other code, the filter is more lax regarding the inner structure of Blosc2 arrays (they may consist of several inner chunks or use different block sizes). It may also work without the extra filter parameters (though it has to skip some data validation, so it warns about this).

See PyTables/PyTables#1056 and PyTables/PyTables#1072 for more information.

t20100 commented 8 months ago

Thanks for the PR!

Could you also update the information about which version or commit of PyTables the filter comes from: https://github.com/silx-kit/hdf5plugin/blob/bd82712ba6146f6f973663c64262e81bf77bfd94/doc/information.rst?plain=1#L60

We try to keep the list of versions embedded in hdf5plugin up-to-date.

ivilata commented 8 months ago

Sure! Updated to PyTables v3.9.2.dev0 in commit 7831797d.

ivilata commented 8 months ago

Thanks @t20100! I was going to add that I'll be doing some extra testing (via PyTables) during the next days to check for more cases. Shall I fix anything, I'll ping you ASAP. I hope to be able to complete this testing before a new hdf5plugin release happens!

t20100 commented 8 months ago

Hi, sure! I made a release of hdf5plugin recently, so there is no urgent need for a release. Let me know when you are done with your testing and I'll make the release.

ivilata commented 8 months ago

Hey @t20100, unfortunately I did find some issues with certain (unusual, I'd say) ways of chunking b2nd arrays which aren't read properly (see this action for instance). It looks like the solution will require depending on a yet-to-be-released version of Blosc2 (which we would publish soon). Would it be ok with you if a new PR updated the version of C-Blosc2? I could post that PR first, then another one to fix hdf5-blosc2 code. Thanks!

t20100 commented 8 months ago

Hi,

Would it be ok with you if a new PR updated the version of C-Blosc2?

We try to keep up-to-date with upstream projects, so a PR with an update to the next release of c-blosc2 is welcomed!

I could post that PR first, then another one to fix hdf5-blosc2 code.

Sounds good to me.

ivilata commented 8 months ago

Hi @t20100, I found out that the problem is contained in PyTables-specific code for optimized array slicing, and filter code is actually not affected, so the code in this PR should be ok as is AFAIK, and hdf5plugins doesn't need to update C-Blosc2 for it to work.

Thanks again, and sorry for the noise!

t20100 commented 8 months ago

Thanks for the information! I'm anyway updating to latest release of c-blosc2 (#283) to enable AVX512.