Encoded waveforms load slower than compressed waveforms

Encoded waveforms load 2-3x slower than gzipped waveforms. If I convert waveforms to being gzipped instead of encoded (see #76) , they load 2-3x faster.

import lgdo
import h5py
import time

store = lgdo.lh5.LH5Store()

input_file = "/home/lv/Documents/uw/l200/l200-p06-r000-phy-20230619T034203Z-tier_raw.lh5"
output_file = "output.lh5"

ch_list = lgdo.lh5.ls(input_file)[2:] # skip FCConfig and OrcaHeader
geds_list = []
with h5py.File(input_file, mode='r') as f:
    for ch in ch_list:
        if 'waveform_windowed' in f[ch]['raw'].keys():
            geds_list.append(ch)

# copy data
for ch in ch_list:
    chobj, _ = store.read(f'{ch}/raw/', input_file)
    store.write(chobj, 'raw', output_file, f'{ch}/')

# read radware-sigcompress encoded waveform_windowed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_windowed', input_file)
print('read radware-sigcompress encoded waveform_windowed ', time.time() - start)

# read gzipped waveform_windowed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_windowed', output_file)
print('read gzipped waveform_windowed ', time.time() - start)

# read ZigZag encoded waveform_presummed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_presummed', input_file)
print('read ZigZag encoded waveform_presummed ', time.time() - start)

# read gzipped waveform_presummed
start = time.time()
for ch in geds_list:
    chobj, _ = store.read(f'{ch}/raw/waveform_presummed', output_file)
print('read gzipped waveform_presummed ', time.time() - start)

gives

read radware-sigcompress encoded waveform_windowed  9.676453590393066
read gzipped waveform_windowed  6.671638011932373
read ZigZag encoded waveform_presummed  14.28751802444458
read gzipped waveform_presummed  5.223653078079224

Forgot to mention that these files are the same size so there is no benefit to storage for using encoding over gzipping.

-rw-r--r-- 1 lv lv 1.6G Mar  8 15:45 ../l200-p06-r000-phy-20230619T034203Z-tier_raw.lh5
-rw-rw-r-- 1 lv lv 1.6G Mar 14 09:00 output.lh5

@gipert suggested trying this on a hard drive, so I tested it on NERSC for a phy raw file. The results were a little different, with gzip being only marginally faster than radware-sigcompress, though gzip is 2x faster than ZigZag encoding. However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why). So I guess the lesson for me is to always check stuff on a hard drive.

@gipert also suggested checking other compression filters. I tested the lzf compression algorithm on NERSC (and on my laptop to double check.) This filter has even better performance than gzip both in terms of compression and speed, which seems somewhat surprising but I don't know what this filter is designed for. On NERSC (and my laptop), a file written with lzf takes up 2.1 GB compared to the original 1.6 GB. But the read speed is amazingly better!

read radware-sigcompress encoded waveform_windowed  19.01413583755493
read lzf waveform_windowed  5.321168899536133
read ZigZag encoded waveform_presummed  25.50418496131897
read lzf waveform_presummed  3.917027235031128

However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why)

Well, this cannot be...

gzip is 2x faster than ZigZag encoding.

Yes, I have also already noticed that ZZ is significantly slower than David's code. I would have originally liked to only use Sigcompress, but then it was decided not to rescale the pre-summed waveform (i.e. samples are of uint32 type). Sigcompress only works with 16-bit integers.

I tested the lzf compression algorithm on NERSC (and on my laptop to double check.) This filter has even better performance than gzip both in terms of compression and speed, which seems somewhat surprising but I don't know what this filter is designed for.

In principle, lzf is optimized on speed and compresses less than GZip (with, I guess, a standard compression level of 5). I'm surprised that it compresses more in our case, but it could be special to our waveforms (?)

Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command). Also make sure that Numba cache is enabled, so it does not need to recompile after the first time!

If we confirm that LZF performance is so good, it could be a safe choice for the next re-processing of the raw tier. It's a famous algorithm so I'm sure it's well supported from the Julia side too.

However, the size of the gzip file is now 3.1 GB compared to the original 1.6 GB (not sure why)

Well, this cannot be...

I retried this and it looks like I screwed up the test I made of gzip on NERSC. I think probably what happened is I accidentally copied the file twice and it appended the rows (hence why the file size was 2x and why the read speed was 2x longer than what I had previously seen). I redid this and the results are in line with what I originally found on my laptop. I attached my code if anyone else wants to reproduce it (recommended). So both gzip and lzf seem faster than the encoding schemes we currently use.

Here is my code: waveform_tests.txt

The default level of compression for gzip in h5py is 4. (https://docs.h5py.org/en/stable/high/dataset.html#dataset-compression). I don't know how much faster it might be to use a lower level of compression.

Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command).

I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files? This is the output from repeating the test three times in a simple loop. The read speeds of each loop are all nearly identical. It does look like the test I made of lzf with waveforms_windowed had a speed increase on the second and third loops.

I checked this by introducing a sleep of 2 minutes between for loops, assuming that this would be enough time for the cache to be cleared (?). I got the ~same results as I show below, roughly 6 seconds (5.4 s, 6.0 s, 7.5 s) to read the waveform_windowed with lzf.

I can perform some more rigorous tests if there are suggestions.

Also make sure that Numba cache is enabled, so it does not need to recompile after the first time!

Is there something special I need to do to enable this? I am not very familiar with Numba. I thought this would be handled by whatever functions we have that are using Numba.

Results from gzip compression on NERSC (all Ge channels for one phy raw file, ~3000 events).

removed /global/u1/v/varriano/legend/users/varriano/output_gzip.lh5
output file size: 1674.144667 MB
read radware-sigcompress encoded waveform_windowed  19.97586750984192
read gzipped waveform_windowed  10.338812589645386
read ZigZag encoded waveform_presummed  26.18614959716797
read gzipped waveform_presummed  7.355187892913818

read radware-sigcompress encoded waveform_windowed  19.864821910858154
read gzipped waveform_windowed  10.373951196670532
read ZigZag encoded waveform_presummed  26.007609128952026
read gzipped waveform_presummed  7.246658086776733

read radware-sigcompress encoded waveform_windowed  20.125010013580322
read gzipped waveform_windowed  10.180893898010254
read ZigZag encoded waveform_presummed  26.434375047683716
read gzipped waveform_presummed  6.846473217010498

Results from lzf compression on NERSC (all Ge channels for one phy raw file, ~3000 events).

removed /global/u1/v/varriano/legend/users/varriano/output_lzf.lh5
output file size: 2184.058997 MB
read radware-sigcompress encoded waveform_windowed  20.135162115097046
read lzf waveform_windowed  9.604149341583252
read ZigZag encoded waveform_presummed  26.367269039154053
read lzf waveform_presummed  4.854575872421265

read radware-sigcompress encoded waveform_windowed  20.017733097076416
read lzf waveform_windowed  6.0519983768463135
read ZigZag encoded waveform_presummed  26.321038007736206
read lzf waveform_presummed  4.818136930465698

read radware-sigcompress encoded waveform_windowed  20.06731939315796
read lzf waveform_windowed  5.865992546081543
read ZigZag encoded waveform_presummed  26.535961389541626
read lzf waveform_presummed  4.388535737991333

I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files?

The caching is always occurring, and the data is kept there until the cache is full (or I guess a significant amount of time has passed). Since the cache is shared among all processes and users, the time it takes for the file to be removed from the cache can vary a lot, but it's not particularly short, I'd say.

Is there something special I need to do to enable this? I am not very familiar with Numba.

Caching is true by default in legend-pydataobj, unless overridden by the user.

To conclude, let's re-discuss all this when we'll be again close in time to re-generating the raw tier. We had to cut-off the discussion last time because of time constraints and we clearly did not study this well enough.

See https://legend-exp.atlassian.net/wiki/spaces/LEGEND/pages/991953524/Investigation+of+LH5+File+Structure+and+I+O. script: test_compression_encoding.py.txt

input file size (encoded waveforms): 1695.61888 MB

no compression/encoding output file size: 9609.928534 MB
write time:  61.316158294677734
GZIP output file size: 1674.927994 MB
write time:  127.61616349220276
SZIP output file size: 1742.706066 MB
write time:  115.50626015663147
LZF output file size: 2184.357187 MB
write time:  75.75087308883667
LZ4 output file size: 2287.242122 MB
write time:  63.4920859336853
LZ4_nbytes4096 output file size: 2355.047048 MB
write time:  63.97322130203247

read no compression waveform_windowed  2.5585856437683105
read radware-sigcompress encoded waveform_windowed  9.69769835472107
read GZIP waveform_windowed  6.3117835521698
read SZIP waveform_windowed  12.409362554550171
read LZF waveform_windowed  3.7812297344207764
read LZ4 waveform_windowed  3.2639217376708984
read LZ4_nbytes4096 waveform_windowed  3.3276045322418213

read no compression waveform_presummed  1.6638846397399902
read ZigZag encoded waveform_presummed  13.844745635986328
read GZIP waveform_presummed  4.550369501113892
read SZIP waveform_presummed  10.218478441238403
read LZF waveform_presummed  2.895991086959839
read LZ4 waveform_presummed  2.7338368892669678
read LZ4_nbytes4096 waveform_presummed  2.821821689605713

read no compression file all else  3.2617311477661133
read input file all else  3.370767116546631
read GZIP file all else  3.445415735244751
read SZIP file all else  3.6045446395874023
read LZF file all else  3.429582357406616
read LZ4 all else  3.3269314765930176
read LZ4_nbytes4096 all else  3.4018807411193848

After some profiling, I think the culprit here is awkward array. Here's a test of reading in a raw file:

         85086 function calls (84980 primitive calls) in 6.409 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    6.409    6.409 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/core.py:18(read)
      9/1    0.001    0.000    6.403    6.403 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/composite.py:41(_h5_read_lgdo)
      3/1    0.002    0.001    6.402    6.402 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/composite.py:264(_h5_read_table)
        2    0.013    0.006    6.387    3.193 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/encoded.py:24(_h5_read_array_of_encoded_equalsized_arrays)
        2    0.000    0.000    6.374    3.187 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/lh5/_serializers/read/encoded.py:44(_h5_read_encoded_array)
        2    0.000    0.000    6.234    3.117 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/compression/generic.py:42(decode)
        2    0.117    0.059    5.245    2.622 /global/u1/i/iguinn/legend_sw/legend-pydataobj/src/lgdo/types/vectorofvectors.py:531(to_aoesa)
    18/10    0.000    0.000    5.002    0.500 /opt/anaconda3/lib/python3.11/site-packages/awkward/_dispatch.py:35(dispatch)
    30/20    0.000    0.000    5.001    0.250 {built-in method builtins.next}
        4    0.000    0.000    4.018    1.004 /opt/anaconda3/lib/python3.11/site-packages/awkward/operations/ak_fill_none.py:17(fill_none)
        2    0.000    0.000    4.018    2.009 /opt/anaconda3/lib/python3.11/site-packages/awkward/operations/ak_fill_none.py:71(_impl)
...

It looks like this line in VectorOfVectors.to_aoesa is causing the slowdown (to_aoesa is taking ~5 s out of ~6 s, and awkward.fill_none is responsible for ~4 s of these ~5 s):

            nda = ak.fill_none(
                ak.pad_none(ak_arr, max_len, clip=True), fill_val
            ).to_numpy(allow_missing=False)

This line is responsible for padding each vector to have the same length and copying it into a numpy array. The problem is that this involves iterating throw the arrays in a slow python-y way it seems. I think it could be made a lot quicker by constructing the numpy array first and then copying into it. Doing this with numba would be very fast. Awkward array may also have a much more optimized way of doing this.

legend-exp / legend-pydataobj

Encoded waveforms load slower than compressed waveforms #77