usnistgov / h5wasm

A WebAssembly HDF5 reader/writer library
Other
86 stars 12 forks source link

Unexpected HTTP requests with createLazyFileLRU #47

Closed Carnageous closed 1 year ago

Carnageous commented 1 year ago

Hi! We are using your (awesome) LazyFileLRU implementation to load specific datasets from large HDF5 files on remote servers. This generally works fine and things like chunking, the LRU options etc. is a great benifit for us!

One thing we noticed is that when requesting datasets of a small size with a larger chunk size, there are often a lot more http requests being made then expected. For example, when requesting a dataset that's only 4 bytes in size with a chunk size of 50kB, we sometimes see 6 or more http requests being made. You can reproduce this by using the hosted version of lazyFileLRU you provided by requesting a dataset like "80.0/definition", which is very small with a big chunk size (for example 1 MB). You should be able to see something like 5 requests being made. We assumed that a maximum of 2 requests should be made here. One to figure out where the search for the dataset should be started and one to actually retrieve it. This can also be reproduced with chunk sizes closer to the actual dataset size. I assume that, in this example, the dataset along with all of it's attributes and metadata should be below a kilobyte of size.

My main question is: What could be the reason for these additional requests? Some things that came to mind:

We would be super grateful if you can spare some of your time to help us understand how fetching of remote data is actually done, or if you can hint us to some documentation for this?

Another, unrelated remark: It would be interesting to understand if we could modify the implementation to load different datasets with different chunk sizes? We have knowledge of the rough size of datasets before loading them, so adjusting the chunk size "per request" would be very valuable to us.

Thank you already for the support you provided in the past! If you can help and need more information, please let me know. I would be able to provide a reproduction example as well.

bmaranville commented 1 year ago

The lookup of a dataset is more complicated than just reading a single address from a header, as you've noticed. A lot of the structure of an HDF5 file is stored in "messages" that are associated with a group, these have to be retrieved themselves before the header information can be reconstructed, and they are not always stored contiguously - some structural elements are stored in address B-trees, just like the data chunks are. So the lookups begin to add up - one to read the address of the messages, then reading all the messages, then navigating the B-tree of chunked datasets... If you want to follow all the reads you could add some logging to the code in https://github.com/usnistgov/jsfive which uses straight javascript to navigate the structure of an HDF5 file.

I'm aware of one other user who has gone to the trouble of pre-indexing their files to pull out the address offsets of datasets of interest, because of this (storing the addresses in a separate database or record). I think they had contiguous datasets so that address + data length + data type was sufficient to start working with the data.

For your other question, that might be tricky - the reader keeps track of which blocks you need by assuming constant block size, so that lookups are just a multiply or divide operation. If blocks can be different sized you need a map of block start and stop values, and a search function. It would be a little slower but compared to IO operations wouldn't really have any impact on performance.