Experimental format for storing HDF5 data as JSON, with or without array chunk metadata
h5tojson provides the option to leave out the array chunk metadata for faster caching and processing of the non-array metadata.
h5tojson is like kerchunk but with human-readable metadata (names, attributes, links, etc.) separated from array data so that the array chunk references do not need to be downloaded if not needed.
In addition:
"acquisition/ElectricalSeries/data/.zarray"
,
"acquisition/ElectricalSeries/data/.zattrs"
, and "acquisition/ElectricalSeries/data/0.0"
. The .zarray
and
.zattrs
values are JSON-encoded strings that must be parsed to get the array shape, dtype, and attributes. This
format makes those values difficult to parse or query without custom code to decode the JSON. In the
h5tojson format, those values are stored as JSON."refs"
key, despite the keys holding a hierarchical
data format. This makes it difficult to query the data without custom code to parse the keys and find one key
based on a different key that matches a query, e.g., to find the key representing a dataset where the parent
group has a particular attribute, requires querying the parent group name and then building the key from the
parent group name and the dataset name. In the h5tojson format, the keys are stored as a hierarchical
structure, so that there is a common parent group between the attribute and the dataset.HDMF can map HDF5 files into builders HDF5Zarr can also translate HDF5 files into Zarr stores
Another, very verbose/detailed way to represent the HDF5 file as JSON is: https://hdf5-json.readthedocs.io/en/latest/examples/tall.html
Install:
git clone https://github.com/rly/h5tojson
cd h5tojson
mamba create -n h5tojson python=3.11 --yes
mamba activate h5tojson
pip install -e ".[dev]"
Optional: Install pre-commit
which runs several basic checks and
ruff
, isort
, black
, interrogate
, and codespell
.
pip install pre-commit
pre-commit install
pre-commit run
Run tests and other dev checks individually:
pytest
black .
ruff .
codespell .
interrogate .
mypy .
isort .
To use notebooks, install jupyterlab
.
pip install dandi
for an API to access the S3 URLs of NWB HDF5 files.from dandi.dandiapi import DandiAPIClient
from h5tojson import H5ToJson
import os
# Get the S3 URL of a particular NWB HDF5 file from Dandiset 000049
dandiset_id = "000049" # ephys dataset from the Svoboda Lab
subject_id = "sub-661968859"
file_name = "sub-661968859_ses-681698752_behavior+ophys.nwb"
with DandiAPIClient() as client:
path = f"{subject_id}/{file_name}"
asset = client.get_dandiset(dandiset_id).get_asset_by_path(path)
s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
# Create an output directory and set the output JSON path
output_dir = f"test_output/{dandiset_id}/{subject_id}"
os.makedirs(output_dir, exist_ok=True)
json_path = f"{output_dir}/sub-661968859_ses-681698752_behavior+ophys.nwb.json"
# Create the H5ToJson translator object and run it
translator = H5ToJson(s3_url, json_path)
translator.translate()
# Translate the same file, but save the DfOverF/data dataset as an individual HDF5 file
translator = H5ToJson(
s3_url, json_path, datasets_as_hdf5=["/processing/brain_observatory_pipeline/Fluorescence/DfOverF/data"]
)
translator.translate()
This package supports taking an existing HDF5 file and translating it into:
skip_all_dataset_data
== True)chunk_refs_file_path
)dataset_inline_max_bytes
, object_dataset_inline_max_bytes
, compound_dtype_dataset_inline_max_bytes
)datasets_as_hdf5
)Answering queries across many dandisets, e.g.:
Run scrape_dandi.py
to generate one JSON file for one NWB file from each dandiset.
Then see queries.ipynb
for example code on how to run some of the above queries using those JSON files.