umami-hep / atlas-ftag-tools

Python tools for ATLAS FTAG
2 stars 17 forks source link

H5Reader.load() needs an optional indices_to_load for loading specific jets/tracks #40

Closed IvanOleksiyuk closed 9 months ago

IvanOleksiyuk commented 1 year ago

This has to be a tule for reading from indices_to_load[0] to indices_to_load[1]. Needed to be integrated in umami/data_tools/loaders.py

samvanstroud commented 1 year ago

Thanks @IvanOleksiyuk. Can you explain the motivation for this addition? The philosophy for upp is that the selected indices should be computed and data selected upon on the fly from a contiguously loaded batch so as to avoid unnecessary data reads. Indexed reads from h5 are very slow.

Is there an example when we cannot cut on the loaded jets at the same time as we compute the indices?

IvanOleksiyuk commented 1 year ago

am not sure what do you mean by indices being "computed". Can you elaborate?

Indexed reads are indeed slow but a sliced read [start:finish] is pretty fast and it seems to be a nice feature to have especially given that Umami seems to be using that. For now the H5reader can only start reading from the start. I think it shouldn't make to much of a problem to implement other starting point.

IvanOleksiyuk commented 1 year ago

It might be also useful if one wants to split the data into seral subsets and read only one of those.

samvanstroud commented 1 year ago

I assume the propsed indices_to_load argument will accept some previously computed indices that correspond to jets passing some given selection.

The question is why not immediately apply the selection (cutting away the failing jets), rather than storing only the intermediate indices for later use.

IvanOleksiyuk commented 1 year ago

https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/umami/data_tools/loaders.py#L159

Here they just need a slice

IvanOleksiyuk commented 1 year ago

For now it seems to only be used here: https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/scripts/decorate_tagger_values.py#L130

Which should be likely possible to rewrite with the current functionality. But still I think that the sliced load might be useful in the future.

samvanstroud commented 1 year ago

I would argue we should avoiding adding it without a clear use case as it can encourage poorly structued workflows involving unnecessary and expensive reads from disk, what do you think?

samvanstroud commented 9 months ago

I think we agreed not to do this, @IvanOleksiyuk feel free to complain if you disagree