legend-exp / pygama

Python package for data processing and analysis
https://pygama.readthedocs.io
GNU General Public License v3.0
18 stars 56 forks source link

Random values in `DataLoader.load()` output if a stream/tier does not provide values for a certain column #518

Open gipert opened 1 year ago

gipert commented 1 year ago

Example: is_valid_0vbb is available in tier hit for the geds subsystem only. If loading data from spms and geds at the same time, is_valid_0vbb will randomly provide True or False (I guess because of uninitialized memory).

This is obviously very dangerous and must be fixed ASAP.

We cannot simply fix it by using default values unfortunately (NaN, for example, works only with floats), so we need to return a different data structure.

jasondet commented 1 year ago

are you sure it's not just a problem related to the config files not being generated correctly? we have other parameters in the hit tier that are there for ged and not spm and did not have this problem in testing, I think, right @gracesong312 ?

gipert commented 1 year ago

Yes I am sure. It's just because in rectangular data structures you obviously need some placeholder, when the hit corresponding to a certain index does not define a column value:

hit_table name    is_valid_0vbb
1052802   V07302A True
1052802   S060    ?????

In case of floats one could use NaN, but no "missing" placeholder exists for booleans.

gracesong312 commented 1 year ago

I think I overlooked this in the original testing, here's a basic test I just ran on the legend-testdata files:

Is it possible to just return None?

gipert commented 1 year ago

In my test I randomly get true or false in is_valid_0vbb for SiPM data. I can work on a MWE tomorrow.

@gracesong312 do you pre-allocate empty columns for the output table? This could explain why the values are unpredictable (because no actual value is ever written to the pre-allocated memory).

Pandas uses NumPy internally, so a boolean column cannot contain non-booleans. I would not force that column to be float in order to be able to use NaNs, I think we should use a different data structure. Discussion for next week!

gracesong312 commented 1 year ago

Yes, I allocate memory with np.empty which explains the random values. https://github.com/legend-exp/pygama/blob/663d352ef9d2d5b73b0c5927dad88692a5f80442/src/pygama/flow/utils.py#L137-L144

jasondet commented 1 year ago

shouldn't the ged hits and spm hits be in two separate lists? otherwise we are going to have nans everywhere, right?

hit_table name energy npe 1052802 V07302A 2039 nan 1052802 S060 nan 3


Jason Detwiler Department of Physics, Box 351560 University of Washington Seattle, WA 98195 (206) 543-4054

On Thu, Oct 12, 2023 at 2:32 PM Grace Song @.***> wrote:

Yes, I allocate memory with np.empty which explains the random values. https://github.com/legend-exp/pygama/blob/663d352ef9d2d5b73b0c5927dad88692a5f80442/src/pygama/flow/utils.py#L137-L144 https://urldefense.com/v3/__https://github.com/legend-exp/pygama/blob/663d352ef9d2d5b73b0c5927dad88692a5f80442/src/pygama/flow/utils.py*L137-L144__;Iw!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73chertl9qJYE$

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/legend-exp/pygama/issues/518*issuecomment-1760392434__;Iw!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73chertEPmOLM$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AARCTXUQNNEWVH7WJLBRB7TX7BOWRANCNFSM6AAAAAA554USNE__;!!K-Hz7m0Vt54!gCX4qyzdSLr9lc6oeLCOU0jsUpBByOXxbbzKqJJzWnJvhvppOTBtQ0-Igj0EzICYxMc438egOj73cherfIjYTXM$ . You are receiving this because you commented.Message ID: @.***>