Preparations for waveform and HDF5 built-in compression

gipert commented 1 year ago

This PR adds the code to configure HDF5 built-in and custom waveform compression. The default behavior is to NOT compress, as it is right now in pygama. Compression will be turned on in a future release.

I have documented the API in the docstrings. Manual page will be written in the future.

List of changes:

[x] import literal Python/Numba translation of original radware_sigcompress code from attic, with unit test
[x] Enforce type stability in Numba-decorated functions
[x] Add wrappers that resize output arrays too
[x] Add wrappers to compress WaveformTables
[x] Add tests on weird waveforms
[x] Implement waveform shift (for non-int16 waveforms)
[x] Implement ULEB128 ZigZag Diff encoding (formerly used by GERDA/MJ)
[x] Add VectorOfEncodedVectors and ArrayOfEncodedEqualSizedArrays LGDOs
[x] Several fixes and new features in VectorOfVectors, remove dangerous VOV.set_vector()
[x] Store decoded data as bytestream array: uint16 -> [uint8, uint8] (big endian)
[x] Integrate in LH5Store.write_object() / LH5Store.read_object()
[ ] ~@guvectorize() _radware_sigcompress_*()~ future PR
[x] Documentation (data spec too at https://github.com/legend-exp/legend-data-format-specs/pull/3)
[ ] ~Add LGDO.encode()/LGDO.decode()?~ future PR
[x] Implement compression (HDF5 built-in too) based on the LGDO compression attribute in write_object()
[x] Implement decompression in read_object()
[x] Fix read_object() reading and decompressing from multiple files buffer bug
[x] Enable data compression after DAQ data decoding stage
[x] Enable compression of new decoded data introduced by the buffer processor (windowing, presumming...)
[x] Add documentation in docstrings
[ ] Add pygama manual page
[x] Interpret strings like "RadwareSigcompress(codec_shift=-32768)" as code (for the buffer processing JSON config)
[x] Populate fc_decoded_values with reasonable compression settings
[x] Maybe add option to use one compression filter everywhere? Saves typing
[ ] Implement simple lossy compressor that removes least significant bits from waveform samples
[ ] Rename RadwareSigcompress to RadwareSigcompress16, implement RadwareSigcompress32 (32-bit version)
[ ] Implement simple Diff encoder to store waveform differences. Will be interesting to test HDF5 compression on top.

jasondet commented 1 year ago

Great start.

What about adding some code into waveform table that converts the values field from aoesa to vov and back?

jasondet commented 1 year ago

Also -- for i/o -- what about allowing for a compression field in the lgdo attrs dict. Then read/write_object() can look for that and call the codecs, or send appropriate flags to HDF5 for other compression routines for non-wf fields (when desired)?

gipert commented 1 year ago

What about adding some code into waveform table that converts the values field from aoesa to vov and back?

I just pushed an implementation of compressing/decompressing WaveformTables. How does it look like? I've put a for loop there, I'm not sure whether there is a better solution in terms of speed.

Also -- for i/o -- what about allowing for a compression field in the lgdo attrs dict. Then read/write_object() can look for that and call the codecs, or send appropriate flags to HDF5 for other compression routines for non-wf fields (when desired)?

@oschulz told me he already thought about how to encode compressed stuff in LH5, but i don't think is documented anywhere yet...

@jasondet can you please give this PR a look and let me know whether I'm missing anything important?

oschulz commented 1 year ago

@oschulz told me he already thought about how to encode compressed stuff in LH5, but i don't think is documented anywhere yet...

I'll dust it off and document it ... it's here:

https://github.com/legend-exp/LegendHDF5IO.jl/blob/3c07adc8d23875d9f107b912d5e3902ab56124b0/src/generic_io.jl#L507

The datatype string is "array_of_encoded_arrays<1,$N>$(_inner_datatype_to_string(T))".

jasondet commented 1 year ago

@oschulz I don't see that datatype documented here. What does the $N mean? How is array_of_encoded_arrays fundamentally different from vector_of_vectors?

oschulz commented 1 year ago

@oschulz I don't see that datatype documented here

Because I still have to add it there - sorry!

How is array_of_encoded_arrays fundamentally different from vector_of_vectors?

It's technically different in that it has datasets encoded_data and decoded_size and an attribute codec. Semantically, it also represents a vector of vectors, just stored in compressed form.

When decompression happens is up to the application. In Julia we'll try to decompress late/lazily, so we can keep more data in memory. Also one day we can maybe even move the compressed arrays to GPU and decompress there, that would enable is to push about three times the number of waveforms from CPU to GPU.

jasondet commented 1 year ago

Thanks @oschulz yes please add it when you get a chance. If there is anything else undocumented that needs to be added, lets do so. Then I propose we finally resolve this versioning issue and tag legend-data-format-specs.

Back to array_of_encoded_arrays -- is decoded_size sufficient? It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov. Maybe we should have decoded_shape or decoded_datatype or some combination instead? For that matter, is the name array_of_encoded_arrays even accurate? The encoded data is not necessarily an array but could be an aoesa or a vov. And the object itself -- is it an array?

Finally, how does the codec attribute get used? Is it just a string?

gipert commented 1 year ago

@oschulz I don't see that datatype documented here

Let's have the spec completed and tagged next week along with this PR. I can also help writing.

It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov.

We should absolutely allow for this!

how does the codec attribute get used? Is it just a string?

I would say so? We have to be explicit and flexible enough to allow any alternative compression algorithm in the future.

oschulz commented 1 year ago

If there is anything else undocumented that needs to be added, lets do so.

Yes, I also need to document how we store histograms.

oschulz commented 1 year ago

It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov

That is application dependent, of course. On the file level, there are just a very few big, flat datasets. In the Julia implementation, both unencoded and encoded arrays of arrays are backed by large contiguous blocks of memory, and compression/decompression doesn't allocate per entry, but globally. decoded_size is sufficient for this, as long we have only vectors/arrays of vectors. I don't see us using compressed arrays of multi-dim-arrays in the forseeable future.

can make an aoesa instead of vov.

That we should handle at in the datatype attribute, I think. Let me think about how to do that best.

Finally, how does the codec attribute get used? Is it just a string?

Yes, it's just a string, for the whole vector of compressed vectors. The only one currently implemented in Julia is "radware_sigcompress_shift".

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 64.53% and project coverage change: +0.75 :tada:

Comparison is base (9d43801) 47.13% compared to head (023a103) 47.88%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #442 +/- ## ========================================== + Coverage 47.13% 47.88% +0.75% ========================================== Files 97 104 +7 Lines 11528 12289 +761 ========================================== + Hits 5434 5885 +451 - Misses 6094 6404 +310 ``` | [Impacted Files](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp) | Coverage Δ | | |---|---|---| | [src/pygama/dsp/processing\_chain.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9kc3AvcHJvY2Vzc2luZ19jaGFpbi5weQ==) | `81.59% <ø> (ø)` | | | [src/pygama/dsp/processors/upsampler.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9kc3AvcHJvY2Vzc29ycy91cHNhbXBsZXIucHk=) | `10.63% <ø> (ø)` | | | [src/pygama/raw/\_\_init\_\_.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9yYXcvX19pbml0X18ucHk=) | `100.00% <ø> (ø)` | | | [...ygama/raw/buffer\_processor/lh5\_buffer\_processor.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9yYXcvYnVmZmVyX3Byb2Nlc3Nvci9saDVfYnVmZmVyX3Byb2Nlc3Nvci5weQ==) | `88.31% <ø> (ø)` | | | [src/pygama/raw/data\_streamer.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9yYXcvZGF0YV9zdHJlYW1lci5weQ==) | `71.33% <ø> (ø)` | | | [src/pygama/raw/raw\_buffer.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9yYXcvcmF3X2J1ZmZlci5weQ==) | `92.02% <ø> (ø)` | | | [src/pygama/raw/data\_decoder.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9yYXcvZGF0YV9kZWNvZGVyLnB5) | `68.53% <18.18%> (-6.15%)` | :arrow_down: | | [src/pygama/lgdo/compression/radware.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9sZ2RvL2NvbXByZXNzaW9uL3JhZHdhcmUucHk=) | `21.53% <21.53%> (ø)` | | | [src/pygama/lgdo/compression/varlen.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9sZ2RvL2NvbXByZXNzaW9uL3Zhcmxlbi5weQ==) | `49.60% <49.60%> (ø)` | | | [src/pygama/cli.py](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp#diff-c3JjL3B5Z2FtYS9jbGkucHk=) | `68.10% <50.00%> (+0.27%)` | :arrow_up: | | ... and [24 more](https://codecov.io/gh/legend-exp/pygama/pull/442?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=legend-exp) | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

gipert commented 1 year ago

I added a draft of VectorOfEncodedVectors, based on the LH5 array<1>{encoded_array<1>{real}} specification. This is how it should look inside a WaveformTable in a file:

── waveform · table{t0,dt,values}
   ├── dt · array<1>{real}
   ├── t0 · array<1>{real}
   └── values · array<1>{encoded_array<1>{real}}
       ├── decoded_size · array<1>{ntuple{real}}
       └── encoded_data · array<1>{array<1>{real}}
           ├── cumulative_length · array<1>{real}
           └── flattened_data · array<1>{real}

I'll adapt the compression/decompression code to deal with this LGDO tomorrow. Maybe @jasondet could have a look at the structure in the meanwhile?

gipert commented 1 year ago

Some benchmarking of our base-128 encoding/decoding of 1000 arrays of 1000 uint16. I'm directly calling the low-level Numba-accelerated routine:

>>> import numpy as np
>>> from pygama.lgdo.compression import varlen

>>> sig_in = np.random.randint(0, 10000, (1000, 1000), dtype="uint16")
>>> sig_out = np.empty((1000, 1000 * 5), dtype="ubyte")
>>> nbytes = np.empty(1000, dtype="uint32")
>>> timeit varlen.uleb128_zigzag_diff_array_encode(sig_in, sig_out, nbytes)
3.64 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> sig_in_dec = np.empty((1000, 1000), dtype="uint16")
>>> siglen = np.empty(1000, dtype="uint32")
>>> timeit varlen.uleb128_zigzag_diff_array_decode(sig_out, nbytes, sig_in_dec, siglen)
10.6 ms ± 704 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I have the feeling it's slower than what we would expect from such a simple encoder.

@oschulz to understand if this is the case, it would be great if you could post here the results of benchmarking the Julia implementation on the same data.

oschulz commented 1 year ago

you could post here the results of benchmarking the Julia implementation on the same data

Here's the same (I think) in Julia:

using EncodedArrays, LegendDataTypes, ArraysOfArrays
using BenchmarkTools

original = [rand(UInt16.(1:10000), 1000) for i in 1:1000]
encoded = [zeros(UInt8, 5*1000) for i in 1:1000]

codec = RadwareSigcompress(UInt16)
EncodedArrays.encode_data!.(encoded, codec, original)
@benchmark EncodedArrays.encode_data!.($encoded, $codec, $original)

results in

BenchmarkTools.Trial: 626 samples with 1 evaluation.
 Range (min … max):  7.091 ms …  17.316 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.908 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.994 ms ± 742.334 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

So it's very comparable. VarlenDiffArrayCodec currently takes about 20 ms for the same in Julia, but the implementation does unnecessary array allocations/copies, have to clean that up.

Decoding RadwareSigcompress (much more important than encoding for us, performance wise) is about 5 ms here, so a throughput of about 2*10^8 samples per second.

gipert commented 1 year ago

Another interesting test at LNGS, reading uncompressed data:

>>> %timeit -r1 -n1 store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
397 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

>>> %timeit -r1 -n1 store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
34.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Wow! Filesystem caching makes a huge difference! For comparison, takes respectively ~40 ms and ~13 ms on my laptop's flash drive.

Let's now see how long does it take to decompress the same (compressed before) data:

>>> from pygama.lgdo.compression import varlen
>>> wf_values, _ = store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
>>> enc_data = varlen.encode(wf_values)

>>> %timeit varlen.decode(enc_data)
162 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The conclusion seems to be that, at least with my implementation of the ZZ codec, reading waveforms would be substantially slower than w/o compression only if data is cached at the FS level.

Note 1: reading compressed data should take less than those ~0.4 s, since it's less. Note 2: read speeds might be "inflated" until the LH5 file re-packing issue is investigated and fixed (@jasondet we should maybe open an issue about that)

gipert commented 1 year ago

@oschulz I'm not showing David's code benchmarks here because I have to re-implement it to use Numba's @guvectorize (to enable broadcasting on arrays of arrays). The Base-128 codec is already implemented like that.

But I also noticed that David's codec seems to be (unexpectedly?) faster.

oschulz commented 1 year ago

But I also noticed that David's codec seems to be (unexpectedly?) faster.

That's strange - the Julia and Numba ports should have basically the same performance as the C version.

In the end it's not a problem for now though, that was the single-core decompression performance - decompression is still way faster in total for a serious machine than we can feed data in over 2x10Gbit network or so.

gipert commented 1 year ago

Set default compression to None, such that this PR will not alter current pygama behavior. Preparing for merge.

legend-exp / pygama

Preparations for waveform and HDF5 built-in compression #442

Codecov Report