Increase read speed by x20-100 for most data

legend-exp / legend-pydataobj

LEGEND Python Data Objects

https://legend-pydataobj.readthedocs.io

GNU General Public License v3.0

1 stars 9 forks source link

Increase read speed by x20-100 for most data #78

Closed lvarriano closed 3 days ago

lvarriano commented 6 months ago

An Array is a dictionary of attributes and an ndarray. It is written to disk as an HDF5 dataset and attributes. A Table is a collection of Array where each Array is written separately to disk. This allows flexibility in writing new objects into a Table without having to open the Table in memory.

However, this is a very poor choice for writing small datasets to disk due to the fact that each HDF5 dataset has to be paged and read separately, as they may not be written on contiguous sectors. This means that, for a Table that contains many Array, each Array must incur the overhead associated with a read even if the Array is extremely small.

In LEGEND, we access and store most of our data in Array, including at the raw, hit, evt, and skm tiers. Only the waveforms in the raw tier are arrays of large size. In particular, at the hit tier in the latest production, we store 50 Array in each germanium channel. Each Array is very small (~10 kB for a single phy file) and so this incurs significant read penalty relative to data size due to how this data is organized on disk.

If, instead, the data were stored in a single array on disk, the read speed can be increased significantly. Here's the results of trying this on a LEGEND hit tier file, code follows. At the hit tier, a read of all of the germanium channels goes from taking 3.6 seconds on my laptop's SSD to taking 36 ms. I stored these data in two ndarrays of single dtype, one float64 and one int32 by promoting the Array dtypes since the impact on storage size is unimportant for such small files. I imagine that this can have a very large impact for users on a computing cluster sharing access to data files. The actual speed-up should scale as the number of Array stored in the Table (at the raw tier, I saw a speed-up of a factor of 20 for non-waveform data).

For the vast majority of physics experiments, as in LEGEND, it is the case that many small variables for an event are stored and need to be retrieved often. This means that the current behavior and storage strategy of Table significantly impacts the utility of LGDO for other uses.

@SamuelBorden and I have been working on an intermediary LGDO object to allow for storage on disk in a fixed type array way while maintaining the functionality of a Table in memory.

read normal (compressed) LH5 file   3.375365972518921
read normal uncompressed LH5 file   3.173330307006836
read uncompressed file organized by type  0.0357668399810791
(3012, 38)
read compressed file organized by type  0.20013427734375

from

import lgdo
import h5py
import time
import numpy as np

store = lgdo.lh5.LH5Store()

input_file = "/home/lv/Documents/uw/l200/l200-p06-r000-phy-20230619T074210Z-tier_pht.lh5"
input_file_uncompressed = "input_uncompressed.lh5"
output_file_bytype = "output.h5"
output_file_bytype_compressed = "output_compressed.h5"

ch_list = lgdo.lh5.ls(input_file)[2:] # skip FCConfig and OrcaHeader

geds_list = []
with h5py.File(input_file, mode='r') as f:
    for ch in ch_list:
        if 'trapEmax_ctc_cal' in f[ch]['hit'].keys():
            geds_list.append(ch)

# copy input file uncompressed
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/hit/', input_file)
    store.write(chobj, 'hit', input_file_uncompressed, f'{ch}/', compression=None)

# copy to a file organized by type
for ch in geds_list:

    chobj, _ = store.read(f'{ch}/hit/', input_file)
    keys = list(chobj.keys())

    scalarfloat64array = []
    scalarint32array = []
    for key in keys:
        data = chobj[key][:]
        if data.dtype.kind == 'f':
            scalarfloat64array.append(np.reshape(data, (len(data),1)))
        else:
            # this casts "channel" from uint32 to int32 which is not so good, but channel should not be stored as uint32 anyway!
            scalarint32array.append(np.reshape(data.astype(np.int32), (len(data),1)))

    scalarfloat64array = np.array(scalarfloat64array, dtype=np.float64).T[0]
    scalarint32array = np.array(scalarint32array, dtype=np.int32).T[0]

    with h5py.File(output_file_bytype, mode='a') as h:
        chgrp = h.require_group(f'{ch}')

        chunks = (np.shape(scalarint32array)[0], np.shape(scalarint32array)[1])
        chgrp.create_dataset('scalarint32', scalarint32array.shape, data=scalarint32array, chunks=chunks)

        chunks = (np.shape(scalarfloat64array)[0], np.shape(scalarfloat64array)[1])
        chgrp.create_dataset('scalarfloat64', scalarfloat64array.shape, data=scalarfloat64array, chunks=chunks)

    with h5py.File(output_file_bytype_compressed, mode='a') as h:
        chgrp = h.require_group(f'{ch}')

        chunks = (np.shape(scalarint32array)[0], np.shape(scalarint32array)[1])
        chgrp.create_dataset('scalarint32', scalarint32array.shape, data=scalarint32array, chunks=chunks, compression='gzip', shuffle=True)

        chunks = (np.shape(scalarfloat64array)[0], np.shape(scalarfloat64array)[1])
        chgrp.create_dataset('scalarfloat64', scalarfloat64array.shape, data=scalarfloat64array, chunks=chunks, compression='gzip', shuffle=True)

# read normal (compressed) LH5 file
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/hit', input_file)
print('read normal (compressed) LH5 file  ', time.time() - start)

# read normal uncompressed LH5 file
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/hit', input_file_uncompressed)
print('read normal uncompressed LH5 file  ', time.time() - start)

# read uncompressed file organized by type 
start = time.time()
with h5py.File(output_file_bytype, mode='r') as f:
    for ch in geds_list:
        data = f[ch]['scalarfloat64'][:]
        data = f[ch]['scalarint32'][:]
print('read uncompressed file organized by type ', time.time() - start)
print(np.shape(data))

# read compressed file organized by type 
start = time.time()
with h5py.File(output_file_bytype_compressed, mode='r') as f:
    for ch in geds_list:
        data = f[ch]['scalarfloat64'][:]
        data = f[ch]['scalarint32'][:]

print('read compressed file organized by type ', time.time() - start)

gipert commented 6 months ago

Thanks @lvarriano for raising this. I was indeed suspecting it!

@oschulz: this is surely something to discuss together

oschulz commented 6 months ago

I don't think writing all columns into a flat array is a good solution. For once, columns may contain different data types. Also, column entries do not always have the same size, e.g. compressed waveforms, SiPM hit lists, etc. So each row can have a different total data size. Of course this could be handles by using a vector of vectors (backed by a single flat vector) to store the rows. This would however make efficient column access impossible. Even in the "flat matrix" approach, accessing just a few columns will we slow, since it would result into highly scattered reads.

Also, this approach basically leads to writing a custom binary on-disk format. That's not a task we should take on, there's already lot's of approaches (Parquet, Arrow, the new ROOT RNTuple, etc.).

Columnar storage is the right storage model for modern computing architectures, so the entries of one columns should be as contiguous on disk and in memory as possible. A single flat-array approach can't provide this.

HDF5 may not be the pinnacle of performance anymore (one hears good stuff about RNTuple, for example), but it's definitely not slow, supports memory mapping (when not using chunks) and it's very widely supported.

What is not a great choice is writing per-channel datasets, though. A single dataset for each columns, spanning all channels (at least of a given subsystem) would reduce the number of datasets in our HDF5 files a lot and increase the size of each dataset. It would be much more friendly to the HDF5 caching model as well. I've suggested this from the beginning, and repeatedly, but per-channel datasets were considered more user-friendly.

lvarriano commented 6 months ago

Yes, in the solution Sam and I are working on, we store the data type of each column so that it can be converted back to the correct data type after reading from disk. From my perspective, it is really not a custom object - hdf5 supports natively (it seems) writing an array that contains different dtypes in each column and h5py can read them back correctly (I don't know about other implementations). I did not know how this would work with Julia, hence the conversion to a single-type array.

Each row in a Table must have the same number of columns. A numpy structured array can handle fields that have different types and sizes (which can be multidimensional). It stores each row as flat, and since each row has the same number of columns, a vector is not needed.

We have seen since #29 that reading particular rows from Array in Table is extremely slow if rows are not contiguous (cannot be sliced). From #35, the default behavior of LH5Store is now to read in the entire Array before slicing. We would pay less relative penalty if our datasets were larger and then chunked better.

Certainly, we could instead make an an array that has all channels in it instead for a single column. I don't see a difference between making that (a single variable, all channel) array and a (single channel, all variable) array. These have all the same issues that you raise, other than that the dtype for the data is a single type and is correct on disk. In this (single variable, all channel) array, the sub-array containing the variables for a single channel would not be contiguous. We could chunk to either take all the events for a single channel or all the channels for a single event to change but there would be a significant penalty to reading event-level data in that chunking.

Given that our experiment adds channels as time moves forward, it does not seem like having files with datasets whose arrays have different numbers of columns would be easy to work with.

oschulz commented 6 months ago

hdf5 supports natively (it seems) writing an array that contains different dtypes in each column and h5py can read them back correctly (I don't know about other implementations)

Yes, HDF5 has a table format, but to my knowledge it's row-based storage, on disk. And I don't think it'll allow for columns that contain vectors, and so on. And I very much doubt that it can handle "ragged-arrays" (let alone nested ones, like we're starting to use now).

We have seen since #29 that reading particular rows from Array in Table is extremely slow

Yes, that's due to the way HDF5 caches datasets. I wonder if this is solvable using memory-mapped reads (does h5py support this?) and not using chunking. In general, we probably will want to write most files (except huge files with waveforms) unchunked.

I don't see a difference between making that (a single variable, all channel) array and a (single channel, all variable) array.

Well, they have a completely different on-disk storage layout, for once. And in the "single channel, all variable" version, if stored flat instead of like we do it now, access to a subset of the columns would be very inefficient.

Given that our experiment adds channels as time moves forward, it does not seem like having files with datasets whose arrays have different numbers of columns would be easy to work with.

Ah - no, the channels do not become columns, of course. Instead, there would be a single additional column for the channel ID, and the current per-channel tables would be concatenated vertically. Interleaving them (first a bunch of channel 1 events, then a bunch of channel 2 events and so on, and then again a bunch of channel 1 events) would actually be best, because it would allow doing both per-channel and across-channel waveform-level operations in an efficient manner.

lvarriano commented 6 months ago

We have seen since #29 that reading particular rows from Array in Table is extremely slow

Yes, that's due to the way HDF5 caches datasets. I wonder if this is solvable using memory-mapped reads (does h5py support this?) and not using chunking. In general, we probably will want to write most files (except huge files with waveforms) unchunked.

Chunking is required for data compression (the whole column can be one chunk, but then you have to read in the whole thing before decompressing). It's true that we don't really need compression for anything but the waveforms, so maybe the rest doesn't need to be chunked. I think that maybe this is what h5py implements as what you are calling memory-mapped reads? https://docs.h5py.org/en/stable/high/dataset.html#multi-block-selection (I'm not familiar myself.) So I guess if we had a slightly more complicated way to store data, then this could be useful.

I don't see a difference between making that (a single variable, all channel) array and a (single channel, all variable) array.

Well, they have a completely different on-disk storage layout, for once. And in the "single channel, all variable" version, if stored flat instead of like we do it now, access to a subset of the columns would be very inefficient.

Given that storing them as such results in a speed up of x100 compared to what we do now, I would not call this "very inefficient" compared to our current strategy. It is very fast to access specific columns in memory compared to on disk, so I would expect that these have comparable times if you try to access 2-3 fields currently.

Given that our experiment adds channels as time moves forward, it does not seem like having files with datasets whose arrays have different numbers of columns would be easy to work with.

Ah - no, the channels do not become columns, of course. Instead, there would be a single additional column for the channel ID, and the current per-channel tables would be concatenated vertically. Interleaving them (first a bunch of channel 1 events, then a bunch of channel 2 events and so on, and then again a bunch of channel 1 events) would actually be best, because it would allow doing both per-channel and across-channel waveform-level operations in an efficient manner.

It sounds to me we would need to implement the correct striding, which would then depend on the number of channels. If some channel is not present for one/a few files or new channels are added, then the striding would change, which sounds problematic to deal with. We do almost exclusively single channel analysis until we finally combine channels into events so I'm not sure there would be any speed up from combining channels into one column.

Maybe I misunderstand, though. Perhaps we can make some sketches to show any proposed data layouts - I'd like to make sure I understand and it is easy to parse the wrong thing from text.

oschulz commented 6 months ago

Chunking is required for data compression

It is for HDF5 compression, but not for user compression. I hope that longer-term, we'll get away from the waveform compression scheme we use now and move to a custom entropy coding that's optimized for our use case.

I think that maybe this is what h5py implements as what you are calling memory-mapped reads

No, memory mapped reads is this:

https://juliaio.github.io/HDF5.jl/stable/#Memory-mapping

I would hope that h5py supports this too, in some fashion?

Given that storing them as such results in a speed up of x100 compared to what we do now

I would be very surprise if we could achive a factor 100 speed increase, for compressed data, compared to what we do now. We also need to consider the maximum I/O speed. I assume you're currently benchmarking on a local SSD? In actual data production we'll be limited by the network I/O of the compute nodes, which has to be divided by the number of parallel I/O operations we want to run.

It sounds to me we would need to implement the correct striding, which would then depend on the number of channels. If some channel is not present for one/a few files or new channels are added, then the striding would change

We would store channel information for each row, so it would be fully dynamic. I don't see us moving to such a scheme anytime soon though.

oschulz commented 6 months ago

If, in the end, we discover that HDF5 performance is limiting our analysis throughput substantially, and if this can't be fixed by using techniques like memory mapping, then we should move to another well-established and broadly supported file format, based on columnar storage. We should not roll our own low-level table storage format.

iguinn commented 5 months ago

So I don't think it's HDF5 that's slow, I think it's python that's slow. I added some cProfile lines to Louis's code:

...
# read normal (compressed) LH5 file
start = time.time()
with cProfile.Profile() as pr:
    for ch in geds_list:
        chobj, _ = store.read(f'{ch}/hit', input_file)
    stats = pstats.Stats(pr)
    stats.sort_stats('tottime').print_stats(10)
print('read normal (compressed) LH5 file  ', time.time() - start)

# read normal uncompressed LH5 file
start = time.time()
with cProfile.Profile() as pr:
    for ch in geds_list:
        chobj, _ = store.read(f'{ch}/hit', input_file_uncompressed)
    stats = pstats.Stats(pr)
    stats.sort_stats('tottime').print_stats(10)
print('read normal uncompressed LH5 file  ', time.time() - start)

# read uncompressed file organized by type
start = time.time()
with cProfile.Profile() as pr:
    with h5py.File(output_file_bytype, mode='r') as f:
        for ch in geds_list:
            data = f[ch]['scalarfloat64'][:]
            data = f[ch]['scalarint32'][:]
    stats = pstats.Stats(pr)
    stats.sort_stats('tottime').print_stats(10)
print('read uncompressed file organized by type ', time.time() - start)
print(np.shape(data))

# read compressed file organized by type
start = time.time()
with cProfile.Profile() as pr:
    with h5py.File(output_file_bytype_compressed, mode='r') as f:
        for ch in geds_list:
            data = f[ch]['scalarfloat64'][:]
            data = f[ch]['scalarint32'][:]
    stats = pstats.Stats(pr)
    stats.sort_stats('tottime').print_stats(10)

print('read compressed file organized by type ', time.time() - start)

Here's the output for the "normal" compressed file.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22590    0.281    0.000    0.772    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/group.py:348(__getitem__)
  4554/90    0.199    0.000    2.031    0.023 /home/iguinn/.local/lib/python3.10/site-packages/lgdo/lh5/store.py:166(read)
     9288    0.169    0.000    0.229    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/attrs.py:52(__getitem__)
     4464    0.155    0.000    0.155    0.000 {method 'read' of 'h5py._selector.Reader' objects}
    22320    0.144    0.000    0.227    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:636(__init__)
     4554    0.118    0.000    0.142    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/group.py:508(__contains__)
    22410    0.071    0.000    0.141    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/files.py:376(__init__)
    22320    0.071    0.000    0.076    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/filters.py:294(get_filters)
   126162    0.046    0.000    0.063    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
     4464    0.043    0.000    0.043    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:522(_fast_reader)

Here's the output for the compressed file organized by type:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      180    0.125    0.001    0.125    0.001 {method 'read' of 'h5py._selector.Reader' objects}
      360    0.006    0.000    0.012    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/group.py:348(__getitem__)
      180    0.002    0.000    0.002    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:522(_fast_reader)
      180    0.002    0.000    0.003    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:636(__init__)
      180    0.001    0.000    0.001    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/filters.py:294(get_filters)
      181    0.001    0.000    0.001    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/files.py:376(__init__)
      180    0.001    0.000    0.001    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:734(_fast_read_ok)
     1081    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
      180    0.000    0.000    0.000    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:625(_extent_type)
      361    0.000    0.000    0.001    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/group.py:34(__init__)

Notice that the h5py._selector.Reader.read method is similar in tottime in both, but in the normal file, it is eclipsed by calls to h5py.group.__get_item__, lgdo.store.read, and h5py.attrs.__get_item__. Reader.read is a cython-written function in h5py that reads the data into numpy arrays, and I think this is ideally what should take the longest. So the real issue is that for short datasets, the python overhead is the limiting factor, the hdf5 performance is just fine.

gipert commented 5 months ago

So we just need to avoid making small files! I think we should seriously reconsider our hit data storing scheme and have all channels in a single table, at some point.

oschulz commented 5 months ago

At the event level (I think that's what you call hit level?), I do use a single table in Julia, that's worked out quite Ok so far.

iguinn commented 5 months ago

We might be able to make small files at least a bit faster...Looking at the profile, it's dominated by python dict lookups in the HDF5 groups. We might get a speedup by making the read less recursive/nested (looking up [ch12345/hit/par] rather than [ch12345][hit][par] for example). I think the nested-ness is probably necessary the first time data is loaded, but if we do in-place loading into preconstructed LGDO structs/tables we might benefit. I tested this, and it won't speed things up for now (I think there is too much recursive reading of the objects). I'm not sure how much speed up is feasible, and this may be quite a bit of work for an unclear gain...

lvarriano commented 5 months ago

I investigated some of the suggestions for memory-mapping (using Python's mmap module and based on https://gist.github.com/maartenbreddels/09e1da79577151e5f7fec660c209f06e) and reorganization of files I put forth. I tried to divide up timings between h5py calls to determine dataset sizes, attributes, etc. from actual reads and formation of the LGDO objects. My code is attached (readtest.py), and here are the results. This is for a phy hit file. The script clears the file system page cache between reads of the different files. I saw some differences between running this in a Jupyter notebook and a python script, so I'm showing the results from the script (lh5.store was a ittle slower in a Jupyter notebook). This is done on my laptop which has an SSD.

output_customstore.lh5 is a file written with lh5.store but with all compression, chunking, and maxshape options turned off so that the datasets are written contiguously to allow for memory mapping. Here is the customized store.py. This file has 90 germanium channels with 50 datasets per channel (one per variable).

output_bytype.h5 writes two 2-D ndarrays per channel, one containing all int and one containing all float. output_bytype_contiguous.h5 is the same but writes it contiguously without chunking. In reading these files, I make lgdo Table from the data to simulate what would be returned to the user. (I cheat a tiny bit and assume that each column of the 2-D array is a separate variable, though this is not necessarily a requirement.) We can see that I get a factor of x~10 speed-up from the current implementation.

Reading the HDF5 attributes (data type, shape, attributes, etc.) seems slow, as @iguinn and Luke (https://indico.legend-exp.org/event/1242/#9-chunking-block-reads-and-fil) pointed out. @gipert noticed this also in #71 where some of the most time-consuming calls are to the attribute reading. I timed h5dump (redirecting its output to a text file to avoid overhead from printing to terminal) which I think is a direct HDF5 C call and it takes roughly identical time. So that to me seems like it is an underlying HDF5 problem that the attribute loading is so slow. It makes sense that this overhead would be negligible for very large datasets, but this is not our situation.

The memory mapping reading is much faster but comes with some limitations, namely that the data must be contiguous and thus we can't append rows to it. I was encouraged to see an h5py read of output_bytype.h5 was nearly as fast because this file is chunked and can allow for appending rows. I don't understand the extremely high h5py info time on the memory-mapped read of output_customstore.lh5 (second to last) and maybe the function there is not very good. Since the h5dump suggests it should be 0.5 s faster, I might take a lower time instead and assume the function can be improved.

Per a suggestion from @SamuelBorden, it might be possible to shave off some time on the h5py attribute gathering if we stored all the attributes at the top-level . We could reorganize to have the groups hit\channel with all attributes and dtypes stored at the hit group (since they are the same for all channels anyway). This would eliminate the unnecessary hit subgroup from each channel, too, though IDK if it really has an impact on performance. Although, I'm not sure how much this suggestion would actually help since I guess we still need to get the shape from the individual datasets (unless we are very certain that all channels must have the same number of events). So maybe it is worth looking at or maybe not.

I really have no idea why it takes so long to create LGDO objects out of output_customstore.lh5 when reading with h5py. There is a ridiculous 0.3 s increase in time when adding attributes to these objects, too. I am using nearly the same code as for output_bytype.h5 though I do need to cast a bunch of columns to bool so lh5.store doesn't yell at me, so maybe that's related. I guess we need to do that in lh5.read, too. I forgot that the Julia analysis will not use the hit files and above, so I would recommend we store these columns as actual bool and avoid the type-casting.

@jasondet

LH5Store read
file name: output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
total time to load attributes, read data, make LGDO objects: 2.926627 s
h5dump time: 0.323647 s

h5py read
file name: output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.471414 s
read data with h5py 0.579042 s
make LGDO objects 0.479138 s
total: 1.529594 s
h5dump time: 0.337380 s

h5py read (no attrs on LGDO objects)
file name: output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.467837 s
read data with h5py 0.565830 s
make LGDO objects (no attrs attached) 0.194555 s
total: 1.228222 s
h5dump time: 0.344106 s

h5py read
file name: output_bytype.h5
number of channels:  90
number of datasets per channel: 2
get h5py info 0.048717 s
read data with h5py 0.124666 s
make LGDO objects 0.089599 s
total: 0.262982 s
h5dump time: 0.053751 s

memory mapped read
file name: output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.832594 s
mem map reads 0.030091 s
make LGDO objects 0.068532 s
total: 0.931217 s
h5dump time: 0.323342 s

memory mapped read
file name: output_bytype_contiguous.h5
number of channels:  90
number of datasets per channel: 2
get h5py info 0.067780 s
mem map reads 0.008713 s
make LGDO objects 0.145769 s
total: 0.222261 s
h5dump time: 0.047205 s

iguinn commented 5 months ago

I tried writing a very simple version of the test where it takes a predefined LGDO structure and reads the data in directly, with fewer nested lookups. Here's that code:

def get_obj_paths(lgdo, base):
    """Crawl through the lh5 tree and find the objects we need to read in"""
    try:
        buf = lgdo.nda
        return {base: buf}
    except:
        vals = {}
        for key in lgdo:
            vals.update(get_obj_paths(lgdo[key], f"{base}/{key}"))
        return vals
#print(get_obj_paths(chobj, f'{ch}/hit'))

start = time.time()
ct = 0
# read normal (compressed) LH5 file to pre-defined obj buffer
with cProfile.Profile() as pr:
    for ch in geds_list:
        lgdo2read = get_obj_paths(chobj, f'{ch}/hit')
        h5f = store.gimme_file(input_file)
        try:
            for key, buf in lgdo2read.items():
                buf.resize(len(h5f[key]))
                h5f[key].read_direct(buf)
            ct += 1
        except:
            pass
    stats = pstats.Stats(pr)
    stats.sort_stats('tottime').print_stats(10)
print(f'read compressed file into pre-defined buffer ({ct} tables read) ', time.time() - start)

The get_obj_paths is just mapping out the LGDO object and finding all the ndarrays we want to read into and mapping them to a full path (e.g. ch#/hit/field). That mapping takes very little time. From there it just directly reads things into the existing buffer without any checks for attrs or anything like that. This ended up taking >4 times longer than the compressed file organized by type (0.69 s vs 0.15 s), but was still ~3 times faster than reading it "normally" (2.1 s). Note that due to the coaxes having different fields, I had to add a try-except to break out of that; the vast majority of channels were still fully read in.

Here's the profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     4446    0.192    0.000    0.278    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:1001(read_direct)
     8898    0.121    0.000    0.316    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/group.py:348(__getitem__)
     8892    0.056    0.000    0.088    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:636(__init__)
     8982    0.030    0.000    0.058    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/files.py:376(__init__)
     8892    0.028    0.000    0.029    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/dataset.py:462(shape)
     8892    0.027    0.000    0.029    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/filters.py:294(get_filters)
     8892    0.015    0.000    0.015    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/selections.py:112(__init__)
    35658    0.014    0.000    0.019    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
     8892    0.013    0.000    0.077    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/base.py:282(file)
     4446    0.009    0.000    0.012    0.000 /home/iguinn/.local/lib/python3.10/site-packages/h5py/_hl/selections.py:236(expand_shape)

So now, the greatest single time contributer is read_direct, which is good to see. It seems there are still a lot of nested lookups in group.__get_item__, so if it's at all possible to remove those that could be a place for further improvement. I think the 0.192 s taken by read_direct is probably a decent reflection of the total time spent on actual I/O rather than python overhead (so longer than Louis's structured array class, but not by that much). That function is still only taking 1/3 of the time, though. I wonder if h5py has some sort of internal function that would let us bypass a lot of its data structures, which seem to be creating a lot of overhead with various init and getitem calls?

If we did want to go this route, I think we'd have to separate out the numpy array I/O from the attrs and the construction of the LGDO array. You could imagine splitting LH5Store.read into two steps, where the first step maps out the data structure and read in the attrs, and the second step fills the buffer. The first step could be bypassed in cases where you are reading a very large number of files (which is the case where the slowness of this all matters).

oschulz commented 5 months ago

What is the total time spent on these nested lookups in a realistic use case, per file? Is it actually limiting analysis performance, esp. if files are read over the network, instead from SSD?

iguinn commented 5 months ago

I tried Louis's script on NERSC, and the difference is more pronounced. It takes 5-6 seconds to read things group-by-group and 0.1/0.4 s to read things as a structured table. I also tried adding my modification and it took 1.8 s, so the distance between this and the structured table increased by quite a bit. Finally, I also tried moving the files into /dev/shm (which is like /tmp), and surprisingly this didn't make a big difference. I also find it very interesting that h5dump in Louis's test above is so much faster than lgdo. I'm not totally clear on why, but it really seems like the issue is with how h5py is accessing things.

NERSC has some pages with advice:

https://docs.nersc.gov/performance/io/dvs/#do-not-use-memory-mapping-mmap here they say to not memory map except on scratch. Most accessing of data is going to be on CFS, so I think mmap isn't a good option for us.
Here they have some specific advice for HDF5 https://docs.nersc.gov/performance/io/library/ 1) They suggest using libver='latest' when writing files, which they claim made a big speedup for a file with a large number of small groups. I actually tried this with Louis's test and it didn't seem to make a difference, but I'm having trouble figuring out if I tried it correctly...This switch would decrease compatibility with older versions of HDF5, but I don't think that's a problem for us. I think we could probably choose a more recent version of the lib even if it isn't the latest if there are concerns about using latest. 2) They suggest using the low-level API (https://api.h5py.org/) which doesn't use the h5py data structures, and essentially directly calls C functions. This would bypass a lot of the __getitem__ and __init__ calls that are making up the bulk of the time. I'll give this a try at some point (unless someone beats me to it).
Finally, they have some slides here: https://www.nersc.gov/users/training/past-training-events/2022/introduction-to-hdf5-for-hpc-data-models-analysis-and-performance-july-27-2022/, which are long so I only skimmed them and may have missed something. But, one thing they say is to use "compact" data storage for small datasets <64 k. I think just about all of our datasets except waveforms are that small, so we may benefit from this (especially compressed).

iguinn commented 5 months ago

I tried following some of NERSC's advice, and it made a pretty big improvement. Here's what I found: From Scratch:

read normal (compressed) LH5 file   5.082636833190918
read normal uncompressed LH5 file   5.151469707489014
read uncompressed file organized by type  0.05175185203552246
read compressed file organized by type  0.23870325088500977
read uncompressed file into pre-defined buffer using low level API (84 tables read)  0.684654712677002
read compressed file into pre-defined buffer using low level API (84 tables read)  0.6115982532501221
read compact file into pre-defined buffer using low level API (84 tables read)  0.3533134460449219

From CFS:

read normal (compressed) LH5 file   5.697180271148682
read normal uncompressed LH5 file   6.186885118484497
read uncompressed file organized by type  0.058754920959472656
read compressed file organized by type  0.24425339698791504
read uncompressed file into pre-defined buffer using low level API (84 tables read)  0.8775436878204346
read compressed file into pre-defined buffer using low level API (84 tables read)  0.715285062789917
read compact file into pre-defined buffer using low level API (84 tables read)  0.4205915927886963

So using the low level API made a big improvement (by a factor of ~7-8). When I print out the profiling information, it looks like the CPU time is very low with the low level API and almost all of the time is spent on file I/O. That said, there's still another factor of ~3 (more if we ignore decompression) to be gained by having larger datasets (either by merging columns as Louis did, or by having all channels in a single group as Oliver and Luigi have suggested).

The "compact file" uses an HDF5 setting that is designed for small datasets, where you write the data along with the dataset header, and it makes a factor of 2 (which is about what NERSC suggests). However, this does not allow chunking, which is required for the built-in gzip compression, so we'd have to think about whether or not that trade of is worth it (which is a moot question if we move to larger datasets).

The low level functions are basically directly calling the C-library for HDF5 (with a few added convenience functions and dunder methods). The low level functions also make all of the optimizations and tweaks possible with HDF5 available (which may or may not be the case with the more pythonic part of h5py). The documentation is here: https://api.h5py.org/. Here's what reading looked like:

    h5f = h5py.h5f.open(input_file.encode('utf-8'), h5py.h5f.ACC_RDONLY)

    for ch in geds_list:
        tab = h5py.h5g.open(h5f, ch.encode('utf-8') + b"/hit")
        try:
            for arrname, arr in chobj.items():
                ds = h5py.h5d.open(tab, arrname.encode('utf-8'))
                sp = ds.get_space()
                #arr.resize(sp.get_simple_extent_dims()[0])
                buf = np.zeros(sp.get_simple_extent_dims()[0], arr.nda.dtype)
                ds.read(sp, sp, buf)
                ds.close()
            ct += 1
        except:
            pass
        tab.close()
    h5f.close()

And here's what writing looked like (for copying the data into the compact file):

h5f = h5py.h5f.create(b"input_compact.lh5")
for ch in geds_list:
    tab, _ = store.read(f'{ch}/hit/', input_file)
    if not 'trapEmax_ctc_cal' in tab.keys(): continue

    ch_gr = h5py.h5g.create(h5f, ch.encode('utf-8'))
    hit_gr = h5py.h5g.create(ch_gr, b"hit")

    for attname, attval in tab.attrs.items():
        attval = np.array(attval.encode('utf-8'), dtype=h5py.h5t.string_dtype())
        h5attr = h5py.h5a.create(hit_gr, attname.encode('utf-8'), h5py.h5t.py_create(attval.dtype, True), h5py.h5s.create(h5py.h5s.SCALAR))
        h5attr.write(attval)

    for key, arr in tab.items():
        h5plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
        #h5plist.set_deflate() # gzip compression; requires chunking
        h5plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER) #nersc suggests this, but I think it might only matter for large datasets
        h5plist.set_layout(h5py.h5d.COMPACT) # helpful for small datasets. This can also be use to set chunked
        h5type = h5py.h5t.py_create(arr.nda.dtype)
        h5space = h5py.h5s.create_simple(arr.nda.shape)
        h5ds = h5py.h5d.create(hit_gr, key.encode('utf-8'), h5type, h5space, h5plist)
        h5ds.write(h5space, h5py.h5s.ALL, arr.nda)

        for attname, attval in arr.attrs.items():
            if isinstance(attval, str):
                attval = np.array(attval.encode('utf-8'), dtype=h5py.h5t.string_dtype())
                h5attr = h5py.h5a.create(h5ds, attname.encode('utf-8'), h5py.h5t.py_create(attval.dtype, True), h5py.h5s.create(h5py.h5s.SCALAR))
            else:
                h5attr = h5py.h5a.create(h5ds, attname.encode('utf-8'), h5py.h5t.py_create(attval.dtype), h5py.h5s.create(h5py.h5s.SCALAR))
            h5attr.write(attval)
h5f.close()

TL;DR: By switching over to the low level API within h5py, we can greatly speed up our code by a factor of almost 10. A further speedup of 2x is available from using "compact" writing of small groups. A speedup of 3-4 is available from writing larger groups, either by saving tables as a single dataset or by saving all channels in a single dataset.

oschulz commented 5 months ago

this does not allow chunking, which is required for the built-in gzip compression,

I would suggest that we unchunk (via h5repack) the default production HDF5 datafiles after they've been written in general (some files we may need to write chunked first due to memory contraints). Such files can then be read via Obviously we can't do that right now with files that contain gzipped waveforms. I'm working on a new waveform compression that doesn't need gzip and should provide a (hopefully) good compression ration. Not sure if that'll be ready in May, I'll keep tinkering.

iguinn commented 5 months ago

I think the waveforms are the only things that are worth chunking/compressing. I just tested h5repack using the same file Louis was using for testing (raw was 1.6 G, dsp was 113M, hit was 35M, all others <1M). After repacking to compact, it ended up at 112M, even though there was no gzip compression; after re-gzipping with h5repack, it actually ended up bigger (I didn't bother trying to optimize the chunking/compression parameters). The pht tier was 35M compact and 33M after re-compressing. So compression is gaining us essentially nothing with these files (which doesn't surprise me too much). My guess is the boolean arrays are the only things that are seeing a significant benefit from compression, since these are stored using 8-bits; we could easily write a compression routine to store 8-bools per char instead of 1. If we switched to the low-level API for h5py and compacted most of our data, this would be ~15x faster than our current read speeds; Louis's tests show that there's more improvement possible, but it's not nearly as big as a difference as comparing to our current scheme.

For the waveforms, chunking/compressing will make a real difference; I'll be interested in seeing how your algorithm performs. Some neutron folks here at ORNL also developed an algorithm that might make an improvement for our files as well.

Edit: Actually, if we are not compressing, the better comparison isn't Louis's 20x speedup with the compressed file, but the 100x speedup with the uncompressed one. It's worth noting that this test doesn't include splitting up the structure into the dict-like table, which may be noticeable, but I'd guess it would still be quite a bit faster than the compressed read.

oschulz commented 5 months ago

For the waveforms, chunking/compressing will make a real difference; I'll be interested in seeing how your algorithm performs.

Wont' be anything fancy, just a simple fixed LPC followed by an ANS entropy coder. But since the diff's (resp. LPC residuals) of our waveforms follow very predictable distributions I think we can get near-optimal performance. I think it can also be quite fast. Just the ANS implementation is a bit tricky, I want it very simple, to make it easy to port accross languages, without getting too mem-hungry.

gipert commented 4 months ago

@iguinn as far as I understand switching to the HDF5 low-level API has no downside and will speed up things significantly. Should we proceed with the implementation?

lvarriano commented 4 months ago

@iguinn Can you post your full script or check if I've implemented what you wrote correctly? My understanding from your recent comment where you used NERSC's advice is that this only tested the read, not getting atttributes or creating the LGDO objects (with the idea being that you read into the predefined buffer of an existing LGDO object, I guess).

I added the code you posted to my script in an attempt to check this, but maybe I am misunderstanding how to do it correctly. I see only a ~30% increase in read speed using the low-level HDF5 read versus h5py read. My full script is here (readtest.txt) but I removed some outputs related to memory mapped reads. For a standard (compressed) LH5 pht file, the read speed with h5py is 0.9 seconds (you show in an earlier comment that compressed and uncompressed speeds are about the same). With the low-level HDF5 read, it takes 0.6 seconds. This is comparing reads only, not collecting attributes or making LGDO objects. In addition, the h5py reads I am doing are creating new numpy arrays not reading directly into a buffer, so there could be a little improvement there anyway.

(Also, maybe you know already why the low level HDF5 call can't read 6 channels.)

LH5Store read
file name: /global/cfs/cdirs/legend/users/varriano/output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
total time to load attributes, read data, make LGDO objects: 3.681001 s
h5dump time: 0.378417 s

h5py read
file name: /global/cfs/cdirs/legend/users/varriano/l200-p06-r000-phy-20230619T074210Z-tier_pht.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.603970 s
read data with h5py 0.913510 s
make LGDO objects 0.509828 s
total: 2.027308 s
h5dump time: 0.507049 s

h5py read
file name: /global/cfs/cdirs/legend/users/varriano/output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.541323 s
read data with h5py 0.646147 s
make LGDO objects 0.534137 s
total: 1.721608 s
h5dump time: 0.380140 s

h5py read (no attrs on LGDO objects)
file name: /global/cfs/cdirs/legend/users/varriano/output_customstore.lh5
number of channels:  90
number of datasets per channel: 50
get h5py info 0.548719 s
read data with h5py 0.652933 s
make LGDO objects (no attrs attached) 0.208819 s
total: 1.410471 s
h5dump time: 0.441594 s

h5py read
file name: /global/cfs/cdirs/legend/users/varriano/output_bytype.h5
number of channels:  90
number of datasets per channel: 2
get h5py info 0.033625 s
read data with h5py 0.038451 s
make LGDO objects 0.079486 s
total: 0.151563 s
h5dump time: 0.031230 s

error:  ch1107203
error:  ch1107204
error:  ch1107205
error:  ch1108800
error:  ch1108801
error:  ch1120005

read using Ian's method
file name: /global/cfs/cdirs/legend/users/varriano/output_compact.lh5
number of channels:  90
number of datasets per channel: 2
read data with underlying h5d 0.297002 s

error:  ch1107203
error:  ch1107204
error:  ch1107205
error:  ch1108800
error:  ch1108801
error:  ch1120005

read using Ian's method
file name: /global/cfs/cdirs/legend/users/varriano/l200-p06-r000-phy-20230619T074210Z-tier_pht.lh5
number of channels:  90
number of datasets per channel: 2
read data with underlying h5d 0.644487 s

error:  ch1107203
error:  ch1107204
error:  ch1107205
error:  ch1108800
error:  ch1108801
error:  ch1120005

read using Ian's method
file name: /global/cfs/cdirs/legend/users/varriano/output_customstore.lh5
number of channels:  90
number of datasets per channel: 2
read data with underlying h5d 0.379961 s

iguinn commented 4 months ago

I think one important caveat is that I haven't done any testing of the low-level API on attrs, so I'm not sure if we should expect similar gains there. In my tests, I was using the already defined lgdo structure to fetch this kind of information. Since this involves small variable length strings, I could believe this will not be that fast.

If reading the attrs cuts into the gains from using the low level API, we could also consider separating out the reading of attrs to form the LGDO structures from the actual reading of data into the structures. This would be a benefit for reading large number of files (which is when the file read speed is a real issue). However, this would also require refactoring read.

For the actual reading of data into the arrays, I think we should proceed. The API is more "C-like" than pythonic and involves a little bit more boilerplate stuff, like explicitly opening datasets within files and setting up dataspace and property structures. The overall flow of things should be very similar, though.

iguinn commented 4 months ago

TestIO.zip Here's the scripts for testing (I split the writing and reading of files up from your original one). I did my testing on perlmutter.

I added the code you posted to my script in an attempt to check this, but maybe I am misunderstanding how to do it correctly. I see only a ~30% increase in read speed using the low-level HDF5 read versus h5py read.

It actually looks like you are getting better performance than me in just the data reading portion of h5py (I commented out the tests, but those took a factor of 2-3 longer than the low level interface for me). In fact you're getting quite a bit better performance out of h5py generally than I did (5 s vs 2 s)

In addition, the h5py reads I am doing are creating new numpy arrays not reading directly into a buffer, so there could be a little improvement there anyway.

My low level code does the same. I think this is somewhat unavoidable because we have to resize the buffers even if we reuse them. One possible solution would be to do what C++ vectors do, and have a capacity (the size of the buffer) that can be different from the length of the array itself, which makes resizing much quicker as long as the capacity is already larger than the resize. This would require refactoring a lot of our data structures though.

Also, maybe you know already why the low level HDF5 call can't read 6 channels.

My low level example fails on those channels because they are missing a column and I was too lazy to handle that case properly.

lvarriano commented 4 months ago

Thanks, Ian!

I added the code you posted to my script in an attempt to check this, but maybe I am misunderstanding how to do it correctly. I see only a ~30% increase in read speed using the low-level HDF5 read versus h5py read.

It actually looks like you are getting better performance than me in just the data reading portion of h5py (I commented out the tests, but those took a factor of 2-3 longer than the low level interface for me). In fact you're getting quite a bit better performance out of h5py generally than I did (5 s vs 2 s)

I see - this is because your script uses the LH5Store.read method instead of the h5py read method, so it is also getting attributes and making the LGDO structures. This is the reason I split out these different aspects in my script, to see where each step is taking time. So the increase in read speed we would expect from changing from h5py calls to underlying HDF5 calls is only ~30% improvement on the read speed. Given that the read speed is at most only ~1/3 of the LH5 Store read call, the actual speed increase we would expect is perhaps ~10%.

I would suggest we do not make any change for the moment to read speed and have a discussion at the collaboration meeting. I would not tackle this problem piecemeal, and we should come up with some comprehensive approach. I'll spend some time in advance of the collaboration meeting putting together a summary and proposal.

iguinn commented 4 months ago

Reading more about HDF5, I've found another thing I'd like to try, which is paged aggregation which is described here: https://docs.hdfgroup.org/hdf5/rfc/paged_aggregation.pdf. Basically what this does is combine multiple datasets into a single memory page, which seems very similar to your output by type strategy, so I think it may see similar performance. I'll try to put a test together before the meeting next week.

oschulz commented 4 months ago

That sounds interesting - is that just a proposal, or is that part of current HDF5 versions?

iguinn commented 4 months ago

It's part of current HDF5. It's also in this document https://github.com/HDFGroup/arch-doc/blob/main/An_Overview_of_the_HDF5_Library_Architecture.v2.pdf which may be more up to date. It seems like h5py has a way to use it in both the high and low level interfaces, too

oschulz commented 4 months ago

On the Julia side, it's seems one can set a "page size" - I hope that's it.

iguinn commented 2 months ago

I've finally had some time to try testing some strategies for speeding things up over the last few days. Here's a few things I think we can do:

1) Minimize calls to hdf5 functions, since even if you are directly interacting with the library these are slow due to all sorts of overhead associated with navigating through the metadata for each object. This can be done by having our recursive functions pass datasets and groups rather than files and object names. In a test hit file, this makes a factor of ~2. See pull request: https://github.com/legend-exp/legend-pydataobj/pull/97 2) Replace h5py high level library with low level library internally when loading datasets, groups, and attributes. This seems to make another factor of ~2 in my testing, with caveats:

The estimate of a factor of 2 comes from a comparison of a simplified low-level readout to a simplified high-level readout. Both of these are quite a bit quicker than the lgdo readout. From profiling, It seems like most of the time is spent in h5py functions rather than things like creating LGDO objects and parsing the datatypes and such, so I think a similar gain might be reasonable to expect
In the interest of simplifying (and speeding up potentially) the code, it will help to constrain the attributes a bit. For example, if we said we will only store string attributes under the lh5 specification, that would be quite helpful (and consistent with how things are done at this time).
When reading out the attributes this way, they became much less of the limiting step; it's not clear to me how the speedups from Lous's pull request will interact with this change. 3) Use paged buffering, which will speedup readout of many small datasets by combining several for I/O operations. This will require changing how we write our files. Fortunately it doesn't require using the low level library if we don't want to, it just requires setting fs_page_size when creating the file https://docs.h5py.org/en/stable/high/file.html. We have would have to choose a default size for the pages; 4096 B is a normal page, but I found there was benefit to going larger. I think there will be some tradeoff here, where larger pages will be helpful for dense reads and smaller pages will be helpful for sparse reads, so some testing will be required. I think we will also have to be thoughtful to make sure chunking will work well with this for waveforms. For my test hit file, I got a factor of ~2 from this (after applying the above two changes in a simplified test)

So between all of these, there may be up to an 8-fold speedup.

lvarriano commented 2 months ago

Thanks, Ian! 1. is very clever and we should certainly do that. Could you post some of your test code for the other items? That could help me understand how it interacts with #94 and whether that PR is still useful. @isaackunen may be interested as well.

isaackunen commented 2 months ago

Yeah, I'm definitely interested in this as well.

One thing that's been hard for me to wrap my head around is what exactly are the scenarios that we care about -- and as Louis's testing for his PR showed, the performance profile is very dependent on exactly what's being done. I think it would be helpful for us to pull together a shared, representative script.

I'll be out of town (and mostly out of touch) for a few weeks, but I could pick this up when I get back.

iguinn commented 2 months ago

TestReadLH5.json

Here's the test file. Note you will want to change the file extension to ipynb (github doesn't accept ipynb attachments). After rerunning it I got less benefit that before...Now it's looking like a factor of 3 between my second two suggestions, and it seems like the paged buffering is making less of a difference than it did earlier.

Basically, I wrote a few test functions for reading in a file and then did %timeit on them. I also have a cell where I profile one of them (right now it's the LGDO one; this was how I identified the issue of how we were recursing through things, it was calling attr.__get_item__ and group.__get_item__ way too many times.

I also found that for the most part using the low level interface wasn't difficult. The two things that are significantly more annoying than the high level interface are indexing, where you have to go through the dataspace class (h5py.h5s); it's not that terrible, but the pythonic interface is more intuitive. The other thing is dealing with attributes' datatypes and loading them in. Fortunately, the low level h5py interface is augmented beyond the bare C with a few convenience functions for working with numpy (like getting numpy.dtypes instead of the C-structs).

iguinn commented 2 months ago

Here's another pull request with performance improvements: https://github.com/legend-exp/legend-pydataobj/pull/100 And here's an update to the above tester: TestReadLH5.json

The biggest change was changing read to use the low level API. After all of these changes, lgdo ends up taking ~900 ms for the test file (compared to ~3 s before all changes, so overall a bit less than I expected). The overall breakdown of the timing is now less than half in reading attributes, but those are still the biggest factor (it's ~350 ms for attrs, 250 ms for datasets/ndarrays, and 200 ms for navigating groups). Comparing LGDO post-changes to my bare-minimum reading, it took almost twice as long (~500 ms for the bare minimum); I'm not exactly sure where the rest of the time goes, but I think it's doing the same set of h5py operations, which would imply that it's from creating LGDO structures, allocating memory, and parsing attributes. That seems long compared to what Louis found before, though.

While trying out different tweaks for hdf5, I also noticed an interesting feature that would work very well with @lvarriano's structured table above, which is virtual datasets (https://docs.hdfgroup.org/hdf5/v1_14/_v_d_s.html, https://docs.h5py.org/en/stable/vds.html). Basically, this let's you add a dataset to a file that doesn't contain its own data but instead references one or more other datasets. The idea of using this with the structured table would be to have one dataset contain all of the table data, and then for each field have a virtual dataset that references a block of the big table, but otherwise acts like a normal LGDO Array. This would have two really nice features: 1) It's possible to access a single column without loading the full table (well, depending on how it's laid out on disk it's still loading a lot of extra data) 2) It can be interfaced with as a normal Table, even without using LGDO or the julia library, which is good for data sharing.

The way this would be structured would be something like:

data: (`attr datatype=table{list, of, elements}`)
-- _raw_data: attr describing layout in a way that can be parsed by LGDO
   - dataset of size nxm
-- col1: attr datatype=array<type>, ...
   - virtual dataset pointing to block (0, 0) -> (l_0, n_rows)
-- col2: attr datatype=array<type>, ...
  - virtual dataset pointing to block (l_0, n_rows) -> (l_1, n_rows)
...

If possible it would be great to have the dataset just be bytes, and then the l_0, l_1... in the above would be based on the size of the datatype. That said, the HDF5 documentation seems to recommend against type casting; in principle, if it's stored as bytes you could do a trivial type cast, but I'm not sure if HDF5 supports this. When accessing this, if you are loading out the full array, LGDO would directly read the dataset and then use numpy to sort it into different LGDO objects the way Louis described above, and when reading a single element, you could just use the virtual datasets. I haven't tried any of this though, so I'm not sure if it would work as seemlessly as I'm hoping. We're still quite far from the performance Louis was able to demonstrate above, so this could easily make another factor of >5 improvement.

lvarriano commented 2 months ago

That's great, Ian! That's great that the lower level API made such a big impact. I haven't had time to look very much at your script yet, but I'll try to take a look at your newer version soon. I made some timing check with the #94 metadata pull request already, and I'll see about this new pull request when I get a chance. If the speed-up is generally so good, it probably doesn't make sense to use #94 and add such complexity - the virtual dataset idea sounds very interesting, especially if you could store attributes of the other datasets with the virtual dataset and avoid having to load them. Maybe it handles that already.

Regarding my previous testing - just in case you are looking at the memory mapping times, Isaac (@isaackunen) told me that I was not doing something correctly and the times are un-physically small for the memory-mapped reads that I report (and he seems right when comparing to my SSD specs). I'm still not exactly sure where I went wrong (lazy reading of data?), but I just wanted to note it in case those are the times you are aiming for. I think the other times I reported for the different reading strategies should still be correct.

isaackunen commented 2 months ago

I'm pretty sure that in Linux, memory mapping is lazy, so unless you actually read a page, the system won't actually pull it in. In my tests, I made sure to force a read and saw some speed improvement with memory mapping, but <2x.It seems possible to do better than that, but what bandwidth are you seeing? This feels like it might be higher than the disk bandwidth.-IsaacOn Jul 11, 2024 16:54, Louis Varriano @.> wrote: That's great, Ian! That's great that the lower level API made such a big impact. I haven't had time to look very much at your script yet, but I'll try to take a look at your newer version soon. I made some timing check with the #94 metadata pull request already, and I'll see about this new pull request when I get a chance. If the speed-up is generally so good, it probably doesn't make sense to use #94 and add such complexity - the virtual dataset idea sounds very interesting, especially if you could store attributes of the other datasets with the virtual dataset and avoid having to load them. Maybe it handles that already. Regarding my previous testing - just in case you are looking at the memory mapping times, Isaac @.) told me that I was not doing something correctly and the times are un-physically small for the memory-mapped reads that I report (and he seems right when comparing to my SSD specs). I'm still not exactly sure where I went wrong (lazy reading of data?), but I just wanted to note it in case those are the times you are aiming for. I think the other times I reported for the different reading strategies should still be correct.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

iguinn commented 1 month ago

After doing some more tests, I've found that the gain is not that great when reading a lot of files from NERSC. A big reason for this is that when using %timeit to test these, the files are getting cached by the DVS system, so reopening them on subsequent runs is much faster. Another big reason for this is that when reading a ton of files, we usually don't read the whole files, but a few datasets from them. Here's the timing, where I first try 6 datasets from one channel in all of DS8's calibration runs:

First try: Took 182.83761382102966 s to read 856 files
%timeit: 36.3 s ± 3.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

So basically, the above changes are really only going to make a difference if we are loading a large fraction of the data from the file, so that this loading time dominates the caching time. One thing that helped a little was opening the files before reading from them (reduced these times to 150 s and 18 s).

The good news is there is a good way to drastically speed up this caching process, using the read only mount (see https://www.nersc.gov/assets/Uploads/DVS-for-NUG.pdf and https://docs.nersc.gov/performance/io/dvs/). This is done by reading from /dvs_ro/cfs/... instead of /global/cfs/.... While convincing people to change how they load data may be a challenge, we can also set up the data loader to load from dvs_ro fairly easily so that once it is ready to be a commonly used tool people will get good performance. However, this mount requires disabling file locking, which hdf5 does by default. There are two ways ways to do this. One is with the environment variable:

%env HDF5_USE_FILE_LOCKING=FALSE

And the other is when constructing the file: h5py.File(..., locking=False). Doing this reduced the time to 12 s for the first read and 10 s for subsequent reads!

Here's the notebook I was using to test this: TestBulkFileRead.json

TL;DR: We should have LGDO open files without file locking in read mode, and tell people to read data from /dvs_ro/cfs/... instead of /global/cfs/...