Closed gipert closed 1 year ago
Great start.
What about adding some code into waveform table that converts the values
field from aoesa to vov and back?
Also -- for i/o -- what about allowing for a compression
field in the lgdo attrs dict. Then read/write_object()
can look for that and call the codecs, or send appropriate flags to HDF5 for other compression routines for non-wf fields (when desired)?
What about adding some code into waveform table that converts the values field from aoesa to vov and back?
I just pushed an implementation of compressing/decompressing WaveformTable
s. How does it look like? I've put a for loop there, I'm not sure whether there is a better solution in terms of speed.
Also -- for i/o -- what about allowing for a compression field in the lgdo attrs dict. Then read/write_object() can look for that and call the codecs, or send appropriate flags to HDF5 for other compression routines for non-wf fields (when desired)?
@oschulz told me he already thought about how to encode compressed stuff in LH5, but i don't think is documented anywhere yet...
@jasondet can you please give this PR a look and let me know whether I'm missing anything important?
@oschulz told me he already thought about how to encode compressed stuff in LH5, but i don't think is documented anywhere yet...
I'll dust it off and document it ... it's here:
The datatype string is "array_of_encoded_arrays<1,$N>$(_inner_datatype_to_string(T))"
.
@oschulz I don't see that datatype documented here. What does the $N mean? How is array_of_encoded_arrays
fundamentally different from vector_of_vectors
?
@oschulz I don't see that datatype documented here
Because I still have to add it there - sorry!
How is array_of_encoded_arrays fundamentally different from vector_of_vectors?
It's technically different in that it has datasets encoded_data
and decoded_size
and an attribute codec
. Semantically, it also represents a vector of vectors, just stored in compressed form.
When decompression happens is up to the application. In Julia we'll try to decompress late/lazily, so we can keep more data in memory. Also one day we can maybe even move the compressed arrays to GPU and decompress there, that would enable is to push about three times the number of waveforms from CPU to GPU.
Thanks @oschulz yes please add it when you get a chance. If there is anything else undocumented that needs to be added, lets do so. Then I propose we finally resolve this versioning issue and tag legend-data-format-specs
.
Back to array_of_encoded_arrays
-- is decoded_size
sufficient? It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov. Maybe we should have decoded_shape
or decoded_datatype
or some combination instead? For that matter, is the name array_of_encoded_arrays
even accurate? The encoded data is not necessarily an array but could be an aoesa or a vov. And the object itself -- is it an array?
Finally, how does the codec
attribute get used? Is it just a string?
@oschulz I don't see that datatype documented here
Let's have the spec completed and tagged next week along with this PR. I can also help writing.
It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov.
We should absolutely allow for this!
how does the codec attribute get used? Is it just a string?
I would say so? We have to be explicit and flexible enough to allow any alternative compression algorithm in the future.
If there is anything else undocumented that needs to be added, lets do so.
Yes, I also need to document how we store histograms.
It would be good to know when allocating memory for the decoded data whether one can make an aoesa instead of vov
That is application dependent, of course. On the file level, there are just a very few big, flat datasets. In the Julia implementation, both unencoded and encoded arrays of arrays are backed by large contiguous blocks of memory, and compression/decompression doesn't allocate per entry, but globally. decoded_size
is sufficient for this, as long we have only vectors/arrays of vectors. I don't see us using compressed arrays of multi-dim-arrays in the forseeable future.
can make an aoesa instead of vov.
That we should handle at in the datatype attribute, I think. Let me think about how to do that best.
Finally, how does the codec attribute get used? Is it just a string?
Yes, it's just a string, for the whole vector of compressed vectors. The only one currently implemented in Julia is "radware_sigcompress_shift"
.
Patch coverage: 64.53
% and project coverage change: +0.75
:tada:
Comparison is base (
9d43801
) 47.13% compared to head (023a103
) 47.88%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I added a draft of VectorOfEncodedVectors
, based on the LH5 array<1>{encoded_array<1>{real}}
specification. This is how it should look inside a WaveformTable
in a file:
── waveform · table{t0,dt,values}
├── dt · array<1>{real}
├── t0 · array<1>{real}
└── values · array<1>{encoded_array<1>{real}}
├── decoded_size · array<1>{ntuple{real}}
└── encoded_data · array<1>{array<1>{real}}
├── cumulative_length · array<1>{real}
└── flattened_data · array<1>{real}
I'll adapt the compression/decompression code to deal with this LGDO tomorrow. Maybe @jasondet could have a look at the structure in the meanwhile?
Some benchmarking of our base-128 encoding/decoding of 1000 arrays of 1000 uint16
. I'm directly calling the low-level Numba-accelerated routine:
>>> import numpy as np
>>> from pygama.lgdo.compression import varlen
>>> sig_in = np.random.randint(0, 10000, (1000, 1000), dtype="uint16")
>>> sig_out = np.empty((1000, 1000 * 5), dtype="ubyte")
>>> nbytes = np.empty(1000, dtype="uint32")
>>> timeit varlen.uleb128_zigzag_diff_array_encode(sig_in, sig_out, nbytes)
3.64 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> sig_in_dec = np.empty((1000, 1000), dtype="uint16")
>>> siglen = np.empty(1000, dtype="uint32")
>>> timeit varlen.uleb128_zigzag_diff_array_decode(sig_out, nbytes, sig_in_dec, siglen)
10.6 ms ± 704 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have the feeling it's slower than what we would expect from such a simple encoder.
@oschulz to understand if this is the case, it would be great if you could post here the results of benchmarking the Julia implementation on the same data.
you could post here the results of benchmarking the Julia implementation on the same data
Here's the same (I think) in Julia:
using EncodedArrays, LegendDataTypes, ArraysOfArrays
using BenchmarkTools
original = [rand(UInt16.(1:10000), 1000) for i in 1:1000]
encoded = [zeros(UInt8, 5*1000) for i in 1:1000]
codec = RadwareSigcompress(UInt16)
EncodedArrays.encode_data!.(encoded, codec, original)
@benchmark EncodedArrays.encode_data!.($encoded, $codec, $original)
results in
BenchmarkTools.Trial: 626 samples with 1 evaluation.
Range (min … max): 7.091 ms … 17.316 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.908 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.994 ms ± 742.334 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
So it's very comparable. VarlenDiffArrayCodec
currently takes about 20 ms for the same in Julia, but the implementation does unnecessary array allocations/copies, have to clean that up.
Decoding RadwareSigcompress
(much more important than encoding for us, performance wise) is about 5 ms here, so a throughput of about 2*10^8 samples per second.
Another interesting test at LNGS, reading uncompressed data:
>>> %timeit -r1 -n1 store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
397 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
>>> %timeit -r1 -n1 store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
34.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Wow! Filesystem caching makes a huge difference! For comparison, takes respectively ~40 ms and ~13 ms on my laptop's flash drive.
Let's now see how long does it take to decompress the same (compressed before) data:
>>> from pygama.lgdo.compression import varlen
>>> wf_values, _ = store.read_object("ch1027201/raw/waveform/values", "/data2/public/prodenv/prod-orig/archive/raw-v01.00/generated/tier/raw/phy/p03/r001/l200-p03-r001-phy-20230322T160139Z-tier_raw.lh5")
>>> enc_data = varlen.encode(wf_values)
>>> %timeit varlen.decode(enc_data)
162 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The conclusion seems to be that, at least with my implementation of the ZZ codec, reading waveforms would be substantially slower than w/o compression only if data is cached at the FS level.
Note 1: reading compressed data should take less than those ~0.4 s, since it's less. Note 2: read speeds might be "inflated" until the LH5 file re-packing issue is investigated and fixed (@jasondet we should maybe open an issue about that)
@oschulz I'm not showing David's code benchmarks here because I have to re-implement it to use Numba's @guvectorize
(to enable broadcasting on arrays of arrays). The Base-128 codec is already implemented like that.
But I also noticed that David's codec seems to be (unexpectedly?) faster.
But I also noticed that David's codec seems to be (unexpectedly?) faster.
That's strange - the Julia and Numba ports should have basically the same performance as the C version.
In the end it's not a problem for now though, that was the single-core decompression performance - decompression is still way faster in total for a serious machine than we can feed data in over 2x10Gbit network or so.
Set default compression to None
, such that this PR will not alter current pygama behavior. Preparing for merge.
This PR adds the code to configure HDF5 built-in and custom waveform compression. The default behavior is to NOT compress, as it is right now in pygama. Compression will be turned on in a future release.
I have documented the API in the docstrings. Manual page will be written in the future.
List of changes:
radware_sigcompress
code from attic, with unit testWaveformTable
sint16
waveforms)VectorOfEncodedVectors
andArrayOfEncodedEqualSizedArrays
LGDOsVectorOfVectors
, remove dangerousVOV.set_vector()
uint16
->[uint8, uint8]
(big endian)LH5Store.write_object()
/LH5Store.read_object()
@guvectorize() _radware_sigcompress_*()
~ future PRLGDO.encode()
/LGDO.decode()
?~ future PRcompression
attribute inwrite_object()
read_object()
read_object()
reading and decompressing from multiple files buffer bug"RadwareSigcompress(codec_shift=-32768)"
as code (for the buffer processing JSON config)fc_decoded_values
with reasonable compression settingsRadwareSigcompress
toRadwareSigcompress16
, implementRadwareSigcompress32
(32-bit version)Diff
encoder to store waveform differences. Will be interesting to test HDF5 compression on top.