Change string to int in output file

paolafer commented 10 months ago

This PR adds the possibility of saving strings as int in the h5 file, to reduce the output size. Such strings are, in the particles table, the initial/final volume of a particle, its creator and final process and its name and, in the hits table, the label of the hits. The other strings saved in the output files appear in a much lower number of rows and do not affect the performance.

A table is added with the string --> int univocal correspondence for all the strings which are converted into int.

While this will be the default behaviour from now on, there's the possibility of still saving strings for debugging or to do small studies, via configuration parameter.

kvjmistry commented 10 months ago

First test: Running NEXT100_muons.init.mac with 300 events.

Test 1: /nexus/persistency/save_strings true Files are saving with strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | ACTIVE | 1 | 0 File Size: 2.1 GB

Test 2: /nexus/persistency/save_strings false Files are saving without strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | 46 | 1 | 0 File Size: 2.1 GB

Strangely, the filesize is still the same?

paolafer commented 10 months ago

Mmm, I used int32 to store the value of the int, maybe I can try with 16? 8 seemed a bit too small.

kvjmistry commented 10 months ago

Can you test this bit of code too, just by simply reading and writing the file size reduces it from 2.1G to 400 MB:

import pandas as pd

input_file = './Next100Muons_example.next.h5'
part   = pd.read_hdf(input_file, 'MC/particles')
config = pd.read_hdf(input_file, 'MC/configuration')
hits   = pd.read_hdf(input_file, 'MC/hits')
# sns_position = pd.read_hdf(input_file, 'MC/sns_position')
sns_response = pd.read_hdf(input_file, 'MC/sns_response')

# Open the HDF5 file in write mode
with pd.HDFStore(f"./Next100Muons_example_rewrite.next.h5", mode = 'w') as store:
    # Write each DataFrame to the file with a unique key
    store.put('config', config)
    store.put('parts',part)
    store.put('hits',hits)
    # store.put('sns_position',sns_position, format='table')
    store.put('sns_response',sns_response)

Maybe there is some data being stored that we don't use? I don't know what would cause this.

paolafer commented 10 months ago

Can you test this bit of code too, just by simply reading and writing the file size reduces it from 2.1G to 400 MB:

import pandas as pd

input_file = './Next100Muons_example.next.h5'
part   = pd.read_hdf(input_file, 'MC/particles')
config = pd.read_hdf(input_file, 'MC/configuration')
hits   = pd.read_hdf(input_file, 'MC/hits')
# sns_position = pd.read_hdf(input_file, 'MC/sns_position')
sns_response = pd.read_hdf(input_file, 'MC/sns_response')

# Open the HDF5 file in write mode
with pd.HDFStore(f"./Next100Muons_example_rewrite.next.h5", mode = 'w') as store:
    # Write each DataFrame to the file with a unique key
    store.put('config', config)
    store.put('parts',part)
    store.put('hits',hits)
    # store.put('sns_position',sns_position, format='table')
    store.put('sns_response',sns_response)

Maybe there is some data being stored that we don't use? I don't know what would cause this.

Yes, the same happens to me. Maybe @jmbenlloch can help us out here!

paolafer commented 10 months ago

We have looked at the files and it turns out that the rewritten file has a compression level of 1, which explains why it is much smaller than the original one, which is not compressed. We decided not to compress h5 files at the nexus level, because the writing time increased enormously. Instead, we decided to compress the files after their creation, which is much faster.

paolafer commented 10 months ago

First test: Running NEXT100_muons.init.mac with 300 events.

Test 1: /nexus/persistency/save_strings true Files are saving with strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | ACTIVE | 1 | 0 File Size: 2.1 GB

Test 2: /nexus/persistency/save_strings false Files are saving without strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | 46 | 1 | 0 File Size: 2.1 GB

Strangely, the filesize is still the same?

I've tried the same as you and I also find no difference in the size of the files. Changing from int32 to int16 or int8 affects very little, too: the file passes from ~4.3 GB to 3.9 GB.

What changes significantly is eliminating the columns with strings completely: I find a reduction from 4.3 GB to 0.7 GB. I recall that you mentioned removing the columns in your first study; had you also tried to use int instead?

kvjmistry commented 10 months ago

First test: Running NEXT100_muons.init.mac with 300 events. Test 1: /nexus/persistency/save_strings true Files are saving with strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | ACTIVE | 1 | 0 File Size: 2.1 GB Test 2: /nexus/persistency/save_strings false Files are saving without strings: event_id | x | y | z | time | energy | label | particle_id | hit_id 0 | 0 | 389.886261 | -307.924469 | 1016.491333 | 2.636023 | 0.008621 | 46 | 1 | 0 File Size: 2.1 GB Strangely, the filesize is still the same?

I've tried the same as you and I also find no difference in the size of the files. Changing from int32 to int16 or int8 affects very little, too: the file passes from ~4.3 GB to 3.9 GB.

What changes significantly is eliminating the columns with strings completely: I find a reduction from 4.3 GB to 0.7 GB. I recall that you mentioned removing the columns in your first study; had you also tried to use int instead?

I can try and re-run some files with the small step size. I was dropping the column before rather than replacing it with an int.

paolafer commented 9 months ago

@kvjmistry, how did your tests go? Can we approve this PR?

kvjmistry commented 9 months ago

Hi @paolafer

So generating 1000 events with a XenonSphere e- 2.5 MeV at 1bar. With strings and without strings both 2.5 GB, so I see no difference as seen earlier.

If I load in the files via python and write again with compression (and change whether I drop the hit label column) I get the following:

with strings column dropped: 742 MB column retained: 927 MB

without strings column dropped: 741 MB column retained: 803 MB

Conclusion in this test:

With and without strings at nexus level doesn't seem to make a difference
After reading and writing the same file with compression we see an improvement in the file size
Dropping the column does has the same effect on the with and without string file which should be expected.

Test 2: Using an extremely small step size of 0.01 mm to bloat up the hit table: 1000 events with 2.5 MeV energy generated in the center. I do not see a difference in the file size between them still. File size is 1.8 GB Writing the file with compression gives 666MB Dropping the hits label column gives a file size of 553 MB

Conclusion: It does seem that the hits label column contributes to ~1/6th the file size in this case (after writing with compression). But I do think the biggest gains was from the reading and writing with compression on. Below is how I am writing to file:

with pd.HDFStore(f"./Next100_smallstep_withstrings_rewrite.next.h5", mode = 'w') as store:
    # Write each DataFrame to the file with a unique key
    store.put('config', config)
    store.put('parts',part)
    store.put('hits',hits)
    store.put('sns_response',sns_response)

paolafer commented 9 months ago

Ok, we can definitely compress files after producing them, which is what is done by default I think, at least with the cosmogenic productions. At any rate, @pnovella found a significant improvement in managing cosmogenic files without strings, therefore I would approve these changes. What remains to be decided is which behaviour we want to be the default one: with or without strings? I personally vote for with-string files, because of readability, but I'm open to other opinions.

kvjmistry commented 9 months ago

Sounds good to me!

I am also in favor of including strings and using this as an option for us to use.

next-exp / nexus

Change string to int in output file #213