scipp / chexus

Validate NeXus files
https://scipp.github.io/chexus/
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

feat: add validator for verifying uniqueness of detector_numbers #36

Closed jokasimr closed 3 weeks ago

jokasimr commented 3 weeks ago

Fixes #35

Note that this changes the way Chexus works in a significant way. ~This change makes Chexus read every dataset in the file and store them in memory while validating.

I think the better solution would be if the validators could decide to read data or not on a case by case basis, to limit the total amount read and to limit the max amount stored in memory.

But that would require some larger changes to the way chexus works.

SimonHeybrock commented 3 weeks ago

Fixes #35

Note that this changes the way Chexus works in a significant way. This change makes Chexus read every dataset in the file and store them in memory while validating.

I think the better solution would be if the validators could decide to read data or not on a case by case basis, to limit the total amount read and to limit the max amount stored in memory.

But that would require some larger changes to the way chexus works.

* Should we do those changes now or wait until this actually becomes a problem in practice?

* Do we want those changes in this PR or in a separate PR?

I think this is going to be a big performance problem in practice. Why not just load all dataset values that are in an allow-list, which for now is maintained by hand?

jokasimr commented 3 weeks ago

Why not just load all dataset values that are in an allow-list, which for now is maintained by hand?

That's an interesting solution that I didn't consider. I just made some changes that lets us read data on-demand in the validators. It's a bit more complicated than the allow-list, but it has the advantage that in the multiple-detector case we don't have to keep all event_ids from all detectors in memory simultaneously. We also don't have to keep a list of rules for what fields should be read.

Do you prefer the allow-list or the load-on-demand version?

SimonHeybrock commented 3 weeks ago

Why not just load all dataset values that are in an allow-list, which for now is maintained by hand?

That's an interesting solution that I didn't consider. I just made some changes that lets us read data on-demand in the validators. It's a bit more complicated than the allow-list, but it has the advantage that in the multiple-detector case we don't have to keep all event_ids from all detectors in memory simultaneously. We also don't have to keep a list of rules for what fields should be read.

Do you prefer the allow-list or the load-on-demand version?

Not really, I am fine with either.