Support viewing h5 files that are written with Pandas

silx-kit / vscode-h5web

VSCode extension to explore and visualize HDF5 files

https://marketplace.visualstudio.com/items?itemName=h5web.vscode-h5web

MIT License

33 stars 5 forks source link

Support viewing h5 files that are written with Pandas #30

Closed dflatow closed 1 year ago

dflatow commented 1 year ago

Is your feature request related to a problem?

I'm writing h5 files via pandas (version 2.1.0). Something simple like this: df.to_hdf(key=key, path_or_buf=path, append=True)

When I go do view the data I get the following:

Requested solution or feature

Would be great to be able to visualize the data. I'm not sure if this is a bug or a feature request.

Alternatives you've considered

Don't see any alternative VScode pluggins to view h5 files.

axelboc commented 1 year ago

Hi @dflatow, thanks for the report.

The Matrix visualisation is capable of displaying datasets with a Compound dtype, as long as every field has a "printable" dtype, which H5Web defines as the following: integer, unsigned integer, float, string, boolean, complex. For a demonstration, you can take a look at dataset /nD_datasets/oneD_compound on H5Web's demo site.

If any of the fields is not printable (which seems to be the case here), H5Web falls back to the "Raw" visualisation, which attempts to serialize the dataset to JSON. Since the dataset seems to contain big integers, JSON.stringify() throws an error:

Could you please share the raw type of the dataset (click on "Inspect" on the row labelled "Raw")? Or even better, could you share an example file?

peku33 commented 1 year ago

I found the same issue. It generally fails if format='table' is passed to pandas.HDFStore.put()

loichuder commented 1 year ago

@axelboc The issue derives from https://github.com/silx-kit/vscode-h5web/issues/15: some part of the compound dataset value has BigInt that are not serializable.

We solved #15 by converting BigInt to regular integers when encountering datasets with integer dtypes but forgot that these BigInt can show up in Compound datasets, such as the ones generated by pandas.

It can be reproduced with a file holding a single compound dataset with a field storing int64 (in this case age):

import numpy as np
import h5py

with h5py.File(...) as h5file:
      # From https://numpy.org/doc/stable/user/basics.rec.html
      h5file.create_dataset(
          "dogs",
          data=np.array(
              [("Rex", 9, 81.0), ("Fido", 3, 27.0)],
              dtype=[("name", "S10"), ("age", "i8"), ("weight", "f4")],
          ),
      )

loichuder commented 1 year ago

In the meantime, it is possible to circumvent the issue when saving with pandas: removing append=True from the call to to_hdf saves the columns as separate datasets rather than in a single compound dataset.

H5Web should not have issues viewing these separate datasets.

axelboc commented 1 year ago

Should be fixed in the next release. I'll try to get it out asap.