Closed lvarriano closed 1 year ago
I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.
One thing that probably has an effect here is that when data is read from disk it tends to be read in contiguous chunks of 4096 kB pages. This means that if the spacing between entries is less than one page, it is still reading the full file. The waveforms are big enough that they should be spread across pages, while the other fields probably use only one or two pages for the full file. So a couple of other things that might be worth checking:
Did we ever think carefully about how to chunk datasets on disk? This is also relevant for HDF5 compression.
https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
1. What if you just look at waveforms?
It is the same slowdown. Loading just the waveforms (using the same
phy
file) prints the following.read entire: 0.02775 +-0.01266 s read by idx, all 1739 rows: 0.02262 +-0.00170 s read by idx, 1738 rows to read with 1 breaks: 0.02238 +- 0.00227 s read by idx, 1737 rows to read with 2 breaks: 0.02199 +- 0.00125 s read by idx, 1734 rows to read with 5 breaks: 1.82485 +- 0.02439 s read by idx, 1729 rows to read with 10 breaks: 1.96193 +- 0.13411 s read by idx, 1719 rows to read with 20 breaks: 1.82573 +- 0.01488 s read by idx, 1689 rows to read with 50 breaks: 1.79066 +- 0.02706 s read by idx, 1639 rows to read with 100 breaks: 1.76084 +- 0.07120 s read by idx, 1539 rows to read with 200 breaks: 1.62478 +- 0.01129 s read by idx, 1239 rows to read with 500 breaks: 1.33234 +- 0.00841 s read by idx, 739 rows to read with 1000 breaks: 0.81090 +- 0.00506 s
For reference, I changed lines in the script above to e.g.
chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind, field_mask=['waveform'])
2. What if you look at a very large file and create much larger gaps?
I used a
cal
file ('l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
) that has 8235 events. The slowdown is actually worse ~x30-40 times slower, which gets even worse if the gaps are larger (up to x80!). Keep in mind that I am reading in only a single channel. The code below prints
read entire: 0.19241 +- 0.00773 s
read by idx, all 8235 rows: 0.29639 +- 0.00475 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.29881 +- 0.00724 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.30981 +- 0.02371 s
read by idx, 8230 rows to read with 5 breaks of size 1: 8.84480 +- 0.16213 s
read by idx, 8225 rows to read with 10 breaks of size 1: 8.78127 +- 0.10855 s
read by idx, 8215 rows to read with 20 breaks of size 1: 9.05398 +- 0.36159 s
read by idx, 8225 rows to read with 1 breaks of size 10: 8.91650 +- 0.21048 s
read by idx, 8215 rows to read with 2 breaks of size 10: 9.19139 +- 0.42725 s
read by idx, 8185 rows to read with 5 breaks of size 10: 9.54173 +- 0.49595 s
read by idx, 8135 rows to read with 10 breaks of size 10: 9.45252 +- 0.59045 s
read by idx, 8035 rows to read with 20 breaks of size 10: 8.92942 +- 0.03804 s
read by idx, 8135 rows to read with 1 breaks of size 100: 9.00981 +- 0.20564 s
read by idx, 8035 rows to read with 2 breaks of size 100: 9.06565 +- 0.15981 s
read by idx, 7735 rows to read with 5 breaks of size 100: 10.00159 +- 0.59773 s
read by idx, 7235 rows to read with 10 breaks of size 100: 11.66310 +- 0.24250 s
read by idx, 6235 rows to read with 20 breaks of size 100: 15.15527 +- 0.22353 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 12.01743 +- 0.36342 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 15.81197 +- 0.61867 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 15.53589 +- 0.40457 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.04306 +- 0.00211 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.04304 +- 0.00054 s
Again, changing line 732 in lh5_store.py
restores the performance.
read entire: 0.17909 +- 0.00186 s
read by idx, all 8235 rows: 0.25028 +- 0.09423 s
read by idx, 8234 rows to read with 1 breaks of size 1: 0.21351 +- 0.00270 s
read by idx, 8233 rows to read with 2 breaks of size 1: 0.21347 +- 0.00163 s
read by idx, 8230 rows to read with 5 breaks of size 1: 0.21279 +- 0.00247 s
read by idx, 8225 rows to read with 10 breaks of size 1: 0.21245 +- 0.00310 s
read by idx, 8215 rows to read with 20 breaks of size 1: 0.21244 +- 0.00192 s
read by idx, 8225 rows to read with 1 breaks of size 10: 0.21133 +- 0.00238 s
read by idx, 8215 rows to read with 2 breaks of size 10: 0.21408 +- 0.00512 s
read by idx, 8185 rows to read with 5 breaks of size 10: 0.21166 +- 0.00131 s
read by idx, 8135 rows to read with 10 breaks of size 10: 0.21438 +- 0.00354 s
read by idx, 8035 rows to read with 20 breaks of size 10: 0.21336 +- 0.00366 s
read by idx, 8135 rows to read with 1 breaks of size 100: 0.21335 +- 0.00283 s
read by idx, 8035 rows to read with 2 breaks of size 100: 0.21340 +- 0.00229 s
read by idx, 7735 rows to read with 5 breaks of size 100: 0.21315 +- 0.00322 s
read by idx, 7235 rows to read with 10 breaks of size 100: 0.20953 +- 0.00246 s
read by idx, 6235 rows to read with 20 breaks of size 100: 0.20589 +- 0.00156 s
read by idx, 7235 rows to read with 1 breaks of size 1000: 0.21116 +- 0.00428 s
read by idx, 6235 rows to read with 2 breaks of size 1000: 0.20491 +- 0.00171 s
read by idx, 3235 rows to read with 5 breaks of size 1000: 0.19387 +- 0.00156 s
read by idx, 0 rows to read with 10 breaks of size 1000: 0.18268 +- 0.00153 s
read by idx, 0 rows to read with 20 breaks of size 1000: 0.18392 +- 0.00217 s
import lgdo.lh5_store as lh5
import numpy as np
from matplotlib import pyplot as plt
import time
raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'
numtests = 10
store = lh5.LH5Store()
times_readall = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file)
end = time.time()
times_readall.append(end-start)
print(f"read entire: {np.mean(times_readall):.5f} +- {np.std(times_readall):.5f} s")
totind = len(chobj)
allind = np.arange(totind)
times_readallidx = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=allind)
end = time.time()
times_readallidx.append(end-start)
print(f"read by idx, all {len(allind)} rows: {np.mean(times_readallidx):.5f} +- {np.std(times_readallidx):.5f} s")
numbreaks = np.array([1, 2, 5, 10, 20])
breaksizes = np.array([1, 10, 100, 1000])
for k in range(len(breaksizes)):
for j in range(len(numbreaks)):
breaks = np.round(np.linspace(0, len(allind) - 1, breaksizes[k]*numbreaks[j])).astype(int)
theseind = np.delete(allind, breaks)
times_readsome = []
for i in range(numtests):
start = time.time()
chobj, _ = store.read_object(test_channel+'/raw', raw_file, idx=theseind)
end = time.time()
times_readsome.append(end-start)
print(f"read by idx, {len(theseind)} rows to read with {numbreaks[j]} breaks of size {breaksizes[k]}: {np.mean(times_readsome):.5f} +- {np.std(times_readsome):.5f} s")
I suggest we add a warning to the documentation if we do not want to address this in the code and possibly increase the amount of memory required to load files. The user can always load the entire object and then perform a slice themselves.
I take this back - I think we should change lh5_store.py
to read the whole object in. In most (all?) cases, the user will read in the data channel by channel in some loop (I'm not familiar enough to understand if it can be done differently), so the amount of memory used at any single point to load in all the data is actually quite small.
From the h5py docs:
An HDF5 dataset created with the default settings will be contiguous.
And we are not setting any chunking option!
(maxshape
is set to the actual full object length above, correct me if I am wrong). So I don't expect any performance/memory gain in indexed reading, compared to reading the full dataset in and then indexing. I don't understand how it can be slower though. Later in the docs section:
Chunking has performance implications. It’s recommended to keep the total size of your chunks between 10 KiB and 1 MiB, larger for larger datasets. Also keep in mind that when any element in a chunk is accessed, the entire chunk is read from disk.
So we should definitely do something about it (even if turning on HDF5 compression should automatically do it? But how?). Continuing:
Since picking a chunk shape can be confusing, you can have h5py guess a chunk shape for you:
dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)
Should we use this feature?
@oschulz might also have an option about all this.
We're doing something on the Julia side to give the user control over chunking, but it's not really finished yet (https://github.com/legend-exp/LegendHDF5IO.jl/pull/35).
In general, non-chunked files are faster to read, since memory-mapping can be used, which will be good for ML applications. For "standard" data storage, chunking makes it possible to write files bit by bit, so the writer doesn't need to have a large amount of memory available. Also, as you point out, compression (I think) always implies chunking on HDF5.
So I was wrong. We do always set maxshape
in order to be able to append to datasets (?):
so (as per h5py docs) all our datasets are chunked. I don't know how they are chunked (I could not find documentation). Maybe the "autochunking" feature is used? I still have no clue what that means.
Note that this implies: it's impossible to write contiguous datasets to disk with legend-pydataobj at the moment.
Here's a stack overflow that was answered by one of the h5py developers https://stackoverflow.com/questions/21766145/h5py-correct-way-to-slice-array-datasets. Apparently fancy indexing is not directly supported in hdf5 (unlike slicing, which is), so they did their own implementation of it that he admits is very slow for >1000 elements (abysmal is his choice word)...So I think this supports the idea that we may want to accept higher memory usage and do fancy indexing through numpy instead.
If we really wanted to optimize this, we could try to implement it with chunks in mind...Basically, look at our index list, figure out which indices are in a chunk and grab those from the chunk and then move to the next one. This would limit the extra memory burden from this to the smaller chunk size. This would also let us skip chunks that don't have any indices in our fancy index, which could speed up sparse data reading (the waveform browser could potentially benefit from this). Of course, this would be a lot of extra work, so for now lets just read in the whole group and then assess if this would be worth it.
looks like this behavior is even properly documented and I just ignored it or didn't worry about it when we implemented this long ago.
Louis's fix seems to handle all use cases with just moderate burden due to memory realloc and its way better than what we have. I propose we just accept his fix for now and come back to this when/if it's the bottleneck again.
What a pity! I agree we clearly need Louis' workaround, at this point. I'm curious to see the effect on DataLoader.load()
.
For interest/posterity, here's a comparison of the new implementation and the previous implementation (accessed by using the new use_h5idx
flag). The script is below and uses the memory-profiler
package. This is reading a single channel for 3 cal
raw
files on my laptop which has an NVME SSD and ext4
file system type (in case those are relevant for future comparisons). The partial reads are a random selection of 5000 rows. The memory penalty is ~130 MB, the size of the channel object in this cal
file. I don't understand all of the spikes in the new implementation, seems like there are one or two extras compared to what I would expect. The speed-up for the partial reads is ~x50 in this example.
There's currently a factor of x2 penalty to speed if the user passes in idx
that corresponds to all events compared to just reading everything without any idx
. In the future, we could add a comparison or some way to check if the user did this. For now, it is up the user to do this. Maybe this is relevant for the DataLoader
?
import lgdo.lh5_store as lh5
import numpy as np
raw_file = 'l200-p03-r003-cal-20230331T223328Z-tier_raw.lh5'
test_channel = 'ch1104002'
store = lh5.LH5Store()
chobj, _ = store.read_object(test_channel+'/raw', raw_file)
allind = np.arange(len(chobj))
theseind = np.sort(np.random.choice(allind, size=5000, replace=False))
del chobj
@profile
def new_readall(files):
chobj, _ = store.read_object(test_channel+'/raw', files)
@profile
def orig_readall(files):
chobj, _ = store.read_object(test_channel+'/raw', files, use_h5idx=True)
@profile
def new_readallidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)
@profile
def orig_readallidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)
@profile
def new_readsomeidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs)
@profile
def orig_readsomeidx(files, idxs):
chobj, _ = store.read_object(test_channel+'/raw', files, idx=idxs, use_h5idx=True)
if __name__ == '__main__':
new_readall([raw_file, raw_file, raw_file])
orig_readall([raw_file, raw_file, raw_file])
new_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
orig_readallidx([raw_file, raw_file, raw_file], [allind, allind, allind])
new_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])
orig_readsomeidx([raw_file, raw_file, raw_file], [theseind, theseind, theseind])
In the pull-request #35, Luigi gave me an idea to convert idx
to a slice if that is possible (idx
has elements all increasing by 1). This restores some speed if the user happens to pass in an idx
corresponding to all rows. Now there is a ~30% penalty to speed instead of x2.
Using the
idx
parameter inread_object()
makes the read about 20x slower. From thelh5_store.py
code, this is related to line 732nda = h5f[name][source_sel]
. I thought that this might be related to the number of discontinuities in the data so I tested for that, and it seems somehow related. I don't understand the sudden jump in read time, though.If, instead, the entire object is read into memory before the list indexing, i.e.
nda = h5f[name][...][source_sel]
, then the speed is about the same. Note that this line does not apply ifread_object
is passed anobj_buf
to read the data into and some other changes will need to be made.We suspect this may be related to DataLoader's performance legend-exp/pygama#521 but it does not account for all of the slowdown. This seems to be a fundamental issue with HDF5 files (from the linked issue "It looks like this performance issue could be related to these: https://forum.hdfgroup.org/t/performance-reading-data-with-non-contiguous-selection/8979 and https://github.com/h5py/h5py/issues/1597").
This is demonstrated below on a single channel in a single file.
prints
If line 732 is replaced as above (to
nda = h5f[name][...][source_sel]
) to read the whole object below list indexing, this prints instead