usnistgov / h5wasm

A WebAssembly HDF5 reader/writer library
Other
86 stars 12 forks source link

Compression filters missing from virtual datasets #75

Closed axelboc closed 3 months ago

axelboc commented 4 months ago

When a virtual dataset points to a compressed dataset, the filters information is missing from the virtual dataset's metadata. This prevents H5Web/h5wasm from knowing to load the compression plugins before attempting to read the data. See https://github.com/silx-kit/vscode-h5web/issues/43

335981317-98589ae5-6b9b-446e-a1c0-2d3d79658502

image

bmaranville commented 4 months ago

As far as I can tell, a VDS can be built from source datasets with heterogeneous metadata (different or no chunking, different compression, dtype, etc.) - so I'm not sure that a one-to-one mirroring is possible. We could probably add a way to get a list of all the source datasets, and then you could use that to decide on loading plugins?

axelboc commented 4 months ago

That seems reasonable. Do you think this could be directly included in the object returned by get_dataset_metadata?

interface Metadata {
  ...
  sourceDatasets?: { fileId: string; path: string }[]
}
bmaranville commented 4 months ago

Do you think there are ever virtual datasets with enough source datasets that this would become a serious performance bottleneck for reading metadata? In h5py for instance, they have a method for retrieving source dataset metadata that is separate from reading the metadata of the virtual dataset Dataset.virtual_sources()

axelboc commented 4 months ago

Maybe at least a count to hint at whether the dataset has virtual sources? Not sure of the performance implications...

bmaranville commented 4 months ago

After making a preliminary implementation, it looks like it's only reading info from the dcpl of the virtual dataset, so that should be really fast (it doesn't have to resolve the source datasets), so I'm going to go with your first suggestion (but use file_name and dset_name to be more similar to what h5py puts out).

Note that sources within the same file seem to have file_name: "."

> f.get("data_compressed_via_vds").metadata
{
  signed: false,
  type: 1,
  vlen: false,
  littleEndian: true,
  size: 8,
  total_size: 50,
  shape: [ 50 ],
  maxshape: [ 50 ],
  chunks: null,
  virtual_sources: [ { file_name: '.', dset_name: '/data_compressed' } ]
}
bmaranville commented 4 months ago

I think this is closed by eb296cb0db9e8cf066f2b2e6ff4429f286da112b (published just now as v0.7.5) Let me know if there are any issues!

axelboc commented 3 months ago

Looks good to me! https://github.com/silx-kit/h5web/pull/1662

Thanks for the quick turnaround, as always 😁