usnistgov / h5wasm

A WebAssembly HDF5 reader/writer library
Other
87 stars 12 forks source link

Support for URLs (like in jsfive) and compound data types #2

Closed PrafulB closed 2 years ago

PrafulB commented 2 years ago

Hi, Thanks so much for your work on both jsfive and this library!

I'm trying to read the data in a remote HDF5 file (https://ndownloader.figshare.com/files/7024985) but was unable to do it using jsfive since it contains compound datatypes. From https://github.com/usnistgov/jsfive/issues/22 , it seems that it's unsupported and might not be for some time. Does h5wasm support reading such data?

Also, since I'm writing code for the browser, I don't have local files and instead want to implement URL-based access. This was supported seamlessly by jsfive, but I can't find a way to do this (pass a URL or an array buffer) in h5wasm. Is this possible?

bmaranville commented 2 years ago

You're right, I haven't implemented direct arraybuffer loading in h5wasm, but in the README I give an example of loading a remote file into h5wasm. Here is the same example, with your url:

import * as hdf5 from "https://cdn.jsdelivr.net/npm/h5wasm@latest/dist/hdf5_hl.js";

let response = await fetch("https://ndownloader.figshare.com/files/7024985");
let ab = await response.arrayBuffer();

hdf5.FS.writeFile("myfile.h5", new Uint8Array(ab));

// use mode "r" for reading.  All modes can be found in hdf5.ACCESS_MODES
let f = new hdf5.File("myfile.h5", "r");

Note that you could also load with mode 'a' if you wanted to modify the file, and you could retrieve the modified file with

f.flush();
let filedata = hdf5.FS.readFile("myfile.h5")

Remember to close your files when you're done working with them, and delete them from the "Filesystem" otherwise they will slowly use up all your available storage space (memory), e.g.

f.close();
hdf5.Module.FS.unlink("myfile.h5");
PrafulB commented 2 years ago

I did see that, and I guess I should've tried it first, but I assumed that because the buffer was being written to "myfile.h5", it was saving the file locally. Is that not the case?

Trying it myself nonetheless.

bmaranville commented 2 years ago

Emscripten makes an emulated filesystem available in the browser for webassembly projects, and that's what you have to use with h5wasm.

PrafulB commented 2 years ago

Ah, got it. Just tried with writeFile, works well. Really appreciate the help, @bmaranville, thanks again!

bmaranville commented 2 years ago

I didn't respond to the other part of your question, though - h5wasm will read compound datasets, but there is no processing provided at the moment (it returns raw bytes). You can still do slicing, so dataset.slice([[0,1]]) will return the raw bytes of the first compound element (including all fields), as a Uint8Array.

I didn't write any processing for Compound type because it is unclear to me how it would be used - the items are stored as "rows" and so it is much less efficient to access if e.g. you wanted to retrieve the first field of all elements, compared to storing the fields as separate datasets. On the other hand, a relatively slow function could be written to automatically process individual elements into an Array of decoded (mixed) native JS types.

How are you planning to use these compound datasets?

PrafulB commented 2 years ago

For now, I'm only doing some exploratory analysis to see if AI modeling on HDF5 files is possible in the browser using TensorFlow.js. I was planning to extract the compound data from the file one dataset at a time, write it to TF Tensors, train on it (using only a few fields most likely) and let it go. I was able to extract and model on numeric data from simpler HDF5 files using jsfive, and so was looking to see if multi-modal data could also be used. The fact that I can get to the data, even as bytes, seems promising enough for now, though a preprocessed result would be great.

That said, I do see the problem with designing the implementation. I'm definitely not well-versed with the HDF5 spec, but maybe a good starting point might be to extract datasets (or slices) into a CSV string directly? Based on their use case, the user might then convert it to JSON themselves or send it to something like Danfo.js for more complex querying. This would support my use case of training on individual datasets perfectly, at least at first glance. It seems to be quite common for HDF5 files to be read into Pandas DataFrames via h5py, so this might not be too far off convention either.

This is probably out of the scope of this issue, but just out of curiosity, how does/would h5wasm support larger files, i.e., ones that cannot possibly fit in memory all at once, and thus cannot be fetched or writeFile()d entirely? Is the plan to use some form of dynamic streaming/range request-based fetching to get relevant data from such files on demand? Or is this not on the agenda currently?

Hope that helped somewhat. Thanks for the clarification on the slicing btw, will try it asap!

bmaranville commented 2 years ago

I added processing for compound datatypes in a new version, 0.1.8 (on npmjs.com) It is recursive and has a lot of overhead compared to reading large datasets of e.g. Float64Array but it seems to work for the sample file you provided:

let t = f.get('Domain_03/OSBS/min_1/boom_1/temperature');
t.slice([[0,4]])
/*
  Array(4) [ (8) […], (8) […], (8) […], (8) […] ]
  0: Array(8) [ "2014-04-01 00:00:00.0", 60, 15.061538467566288, … ]
  1: Array(8) [ "2014-04-01 00:01:00.0", 60, 14.998577866382611, … ]
  2: Array(8) [ "2014-04-01 00:02:00.0", 60, 15.262312876008354, … ]
  3: Array(8) [ "2014-04-01 00:03:00.0", 60, 15.453513600198388, … ]
  length: 4
  <prototype>: Array []
*/

As to very large files that can't fit in memory, I have not really planned for supporting those. One would have to tinker with the filesystem emulator I think, so that seek commands would trigger a cache lookup + possible retrieval with range request. This would be quite a big job (and might be easier in jsfive, to be honest, since there's no filesystem to deal with).

On the other hand, the HDF group supports a system called HSDS for handling very large datasets over the network, where chunks are retrieved on demand, with smart caching etc. and I would have a look at that if I were you.

PrafulB commented 2 years ago

Wow, that was quick! Thanks a lot @bmaranville , can confirm it works great! Not sure if you want to close this issue or if you'd like to use it as a tracker for adding in parsing optimizations. Feel free either way, my issue seems to have been resolved.

Appreciate the info on handling larger files as well! I have seen HSDS before but was really hoping to avoid a backend-based solution as far as possible. That said, I can understand the Emscripten file system limitation. Hope some workaround can be found for client-side access still, since a lot of useful HDF5 files in my experience are too large to fit in memory 😅. Looking forward eagerly to further developments in this library!

bmaranville commented 2 years ago

resolved by 1d96d16b52371f1900bef9f119d13cd62a8db004