silx-kit / jupyterlab-h5web

A JupyterLab extension to explore and visualize HDF5 file contents. Based on https://github.com/silx-kit/h5web.
MIT License
65 stars 8 forks source link

Crashes reading a large file #71

Closed jonwright closed 2 years ago

jonwright commented 3 years ago

I am assuming this is the project behind the wonderful thing I found yesterday that lets me browse hdf5 files in jupyterlab? It looks fantastic. I wish I could figure out how to select x and y axes for a plot? I always see data versus point number. The rest of the message is a bug report for how I seem to have broke something already (sorry!) :

Describe the bug

jupyterlab crashes when reading large dataset, perhaps an out of memory error?

To Reproduce

1 - Log into jupyter-slurm.esrf.fr with one single core and the lab interface 2 - Navigate to open : /data/id11/nanoscope/blc12407/id11/CeO2_38keV/CeO2_38keV_CeO2_rotation/CeO2_38keV_CeO2_rotation.h5 3 - open dataset /1.1/measurement/eiger : it displays 4 - open dataset /1.1/measurement/fpico6 : it displays 5 - go back to /1.1/measurement/eiger : jupyterlab stops running 6 - all the other tabs and kernels appear to exit when jupyterlab fails

Expected behaviour

In the worst case, a plugin would crash without taking down all of the other kernels. Ideally it would not crash.

Is there a way to use hdf5 slice operations (maybe combined with fast histograms) so you only hold in memory what is going to be displayed on the screen (e.g. maximum data is a 2D image)? Then libhdf5 should manage the memory cache in some sensible way.

Context

Extension lists This is based on a bit of guesswork as to what is actually running when I use jupyter-slurm :
jupyter-slurm:~ % /scisoft/users/jupyter/jupy38ubuntu/bin/jupyter labextension list
JupyterLab v2.3.1
Known labextensions:
   app dir: /home/esrf/jupyter/jupy38ubuntu/share/jupyter/lab
        @jupyter-widgets/jupyterlab-manager v2.0.0  enabled  OK
        jupyter-matplotlib v0.7.4  enabled  OK
        jupyter-threejs v2.2.0  enabled  OK
        jupyterlab-datawidgets v6.3.0  enabled  OK
        jupyterlab-h5web v0.0.10  enabled  OK
        k3d v2.9.3  enabled  OK
jupyter-slurm:~ % /scisoft/users/jupyter/jupy38ubuntu/bin/jupyter serverextension list
config dir: /home/esrf/jupyter/jupy38ubuntu/etc/jupyter
    jupyterlab_h5web  enabled 
    - Validating...
      jupyterlab_h5web  OK
    jupyterlab  enabled 
    - Validating...
      jupyterlab 2.3.1 OK
    jupyterlab_hdf  enabled 
    - Validating...
      jupyterlab_hdf 0.5.1 OK
    jupyter_nbextensions_configurator  enabled 
    - Validating...
      jupyter_nbextensions_configurator 0.4.1 OK

loichuder commented 3 years ago

Hello Jon, thanks for trying the extension and for the feedback !

Axis selection

I wish I could figure out how to select x and y axes for a plot?

Well, h5web is a "dumb" viewer: it will only display visualizations corresponding to the content of the file. It is not meant to be a visualization tool. The only way to select x and y axes for a plot would be to use a NXData group with an attribute axesas the NeXus standard is supported by h5web.

Reasons of the crash when reading a large dataset

This is due to a limitation in the Line visualisation: we have a feature (auto-scale off) where the axis limits are set to the limits of the full dataset. As a consequence, when using the Line, h5web fetches the full dataset. In this case, I believe this is around 256 GB (:scream:) making the whole Jupyter server crash. I still need to investigate the exact reason.

Note that the Heatmap suffers not from this limitation: it only fetches the slice. This is why the first display of /1.1/measurement/eiger works. It is the switch to the 1D dataset /1.1/measurement/fpico6 that make h5web switch to the Line visualisation when coming back to /1.1/measurement/eiger.

What is next, then?

Is there a way to use hdf5 slice operations (maybe combined with fast histograms) so you only hold in memory what is going to be displayed on the screen (e.g. maximum data is a 2D image)?

It would indeed make sense to fetch only the slice even for a Line visualization. The Auto-scale feature puts a large limitation for large datasets and we need to work somehow around that.

We have an issue in h5web where we track our ideas and improvements to fetch large datasets: https://github.com/silx-kit/h5web/issues/616. The discussion about the auto-scale will surely continue there and any implementation fixing the crash will be mentioned there.

In the mean time, use the Heatmap ? :sweat_smile:

andygotz commented 3 years ago

@jonwright thanks for the +ve feedback.

@loichuder thanks for the explanations. It seems like we are missing a tool to do flexible viewing of Nexus files i.e. selecting what to display against what. AM I right to say that users have to build their own tool with a mixture of h5py and matplotlib for now? Does bragy address this?

axelboc commented 3 years ago

This is outside of the scope of Braggy, for sure. It's always possible to make a new GUI, but note that a solution to this problem is to generate a NeXus-compliant HDF5 file with external links to the relevant datasets, and then open this file in H5Web. Obviously not as practical as a GUI, but we could easily provide Python utilities to make generating this sort of file a breeze (perhaps these utilities already exist, even).

t20100 commented 3 years ago

There is already some helpers to save NXData: nexusformat or silx.io.nxdata.save_NXdata.

Otherwise since this runs in a notebook, using matplotlib or any other plot library is probably best suited for tailored plots if not saved as NXData.

BTW, in silx view, there is a feature to create "virtual NXData" by dragging and dropping datasets as signal and axes, but to me it is a bit complex since one needs to know about NeXus to use it.

loichuder commented 2 years ago

Following on the crash issue, we have something in the works to solve it: https://github.com/silx-kit/h5web/issues/616#issuecomment-982734122

I will close this once this is shipped in a jupyterlab-h5web release.

loichuder commented 2 years ago

https://github.com/silx-kit/h5web/issues/616#issuecomment-982734122 was integrated in v0.1.0 that is now deployed in jupyter-slurm.