hdmf-dev / hdmf

The Hierarchical Data Modeling Framework
http://hdmf.readthedocs.io
Other
46 stars 26 forks source link

improve html representation of datasets #1100

Open h-mayorquin opened 5 months ago

h-mayorquin commented 5 months ago

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)

data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)

# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)

from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

image

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)

path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)

from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile

image

Checklist

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 70.37037% with 16 lines in your changes missing coverage. Please review.

Project coverage is 88.96%. Comparing base (b78625b) to head (3813723).

Files with missing lines Patch % Lines
src/hdmf/container.py 70.37% 10 Missing and 6 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #1100 +/- ## ========================================== - Coverage 89.03% 88.96% -0.07% ========================================== Files 45 45 Lines 9883 9932 +49 Branches 2813 2824 +11 ========================================== + Hits 8799 8836 +37 - Misses 767 774 +7 - Partials 317 322 +5 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

h-mayorquin commented 5 months ago

OK, I added table formating for hdf5:

image

h-mayorquin commented 5 months ago

@stephprince Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

h-mayorquin commented 5 months ago

Related:

https://github.com/hdmf-dev/hdmf-zarr/issues/186

h-mayorquin commented 5 months ago

I added the handling division by zero, check this out what happens with external files (like Video):

image

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile
stephprince commented 5 months ago

There are still some failing tests for different python versions, it looks like one of the reasons is because h5py added the .nbytes attribute in version 3.0 and we still have h5py==2.10 as a minimum version.

>       array_size_in_bytes = array.nbytes
E       AttributeError: 'Dataset' object has no attribute 'nbytes'

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

oruebel commented 5 months ago

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

Checking if hasattr(data, "nbytes") to optionally seems reasonable to me. In this way you can also avoid custom behavior depending on library versions.

h-mayorquin commented 5 months ago

@stephprince

I'm not sure if there's another way to access that information or if we would just want to optionally display it if available.

It can be estimated from the dtype and the number of elements. I will do that when the attribute does not exists.

rly commented 17 hours ago

@stephprince when you have time, can you review this?