legend-exp / legend-pydataobj

LEGEND Python Data Objects
https://legend-pydataobj.readthedocs.io
GNU General Public License v3.0
1 stars 9 forks source link

`write` does not respect `codec` #76

Open lvarriano opened 7 months ago

lvarriano commented 7 months ago

If I take an encoded waveform, read it and write it to a new file, it will not be written as encoded. (It will be written as a gzipped array, but this is because LH5Store.write() now defaults to compression.) However, it will still have the attributes associated with encoding ('codec', etc.) This does not seem like behavior that the user would expect. Reading in a data set and then writing it to file should not change how the data is stored.

This is because read decodes the waveform and stores it in an ArrayOfEqualSizedArrays. When write comes to this object, it does not encode it because it is not an ArrayOfEncodedEqualSizedArrays.

import lgdo
import h5py
import numpy as np

store = lgdo.lh5.LH5Store()

input_file = "/home/lv/Documents/uw/l200/l200-p06-r000-phy-20230619T034203Z-tier_raw.lh5"
output_file = "output.lh5"

ch_list = lgdo.lh5.ls(input_file)[2:] # skip FCConfig and OrcaHeader

# copy data
for ch in ch_list:
    chobj, _ = store.read(f'{ch}/raw/', input_file)
    store.write(chobj, 'raw', output_file, f'{ch}/')

ch = 'ch1027200'

print('load input file with LH5Store')
chobj, _ = store.read(ch+'/raw/', input_file)
print(chobj['waveform_windowed']['values'].attrs)
print(chobj['waveform_windowed']['values'].nda.shape)
print(np.prod(chobj['waveform_windowed']['values'].nda.shape))

print('\nload input file with h5py')
with h5py.File(input_file, mode='r') as f:
    print(f[ch]['raw']['waveform_windowed']['values'].attrs.keys())
    print(f[ch]['raw']['waveform_windowed']['values'].keys())
    print(f[ch]['raw']['waveform_windowed']['values']['encoded_data'].keys())
    print(f[ch]['raw']['waveform_windowed']['values']['encoded_data']['flattened_data'].shape)
    print(f[ch]['raw']['waveform_windowed']['values']['encoded_data']['flattened_data'].compression)

print('\nload output file with LH5Store')
chobj, _ = store.read(ch+'/raw/', output_file)
print(chobj['waveform_windowed']['values'].attrs)
print(chobj['waveform_windowed']['values'].nda.shape)
print(np.prod(chobj['waveform_windowed']['values'].nda.shape))

print('\nload output file with h5py')
with h5py.File(output_file, mode='r') as f:
    print(f[ch]['raw']['waveform_windowed'].keys())
    print(f[ch]['raw']['waveform_windowed']['values'].attrs.keys())
    print(f[ch]['raw']['waveform_windowed']['values'].shape)
    print(f[ch]['raw']['waveform_windowed']['values'].compression)

gives

load input file with LH5Store
{'codec': 'radware_sigcompress', 'codec_shift': -32768.0, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'}
(3034, 1400)
4247600

load input file with h5py
<KeysViewHDF5 ['codec', 'codec_shift', 'datatype']>
<KeysViewHDF5 ['decoded_size', 'encoded_data']>
<KeysViewHDF5 ['cumulative_length', 'flattened_data']>
(2427360,)
None

load output file with LH5Store
{'codec': 'radware_sigcompress', 'codec_shift': -32768.0, 'datatype': 'array_of_equalsized_arrays<1,1>{real}'}
(3034, 1400)
4247600

load output file with h5py
<KeysViewHDF5 ['dt', 't0', 'values']>
<KeysViewHDF5 ['codec', 'codec_shift', 'datatype']>
(3034, 1400)
gzip
gipert commented 7 months ago

Yes we should discuss what the behavior should be. I was tracking this in #37.