Closed IvanOleksiyuk closed 9 months ago
Thanks @IvanOleksiyuk. Can you explain the motivation for this addition? The philosophy for upp is that the selected indices should be computed and data selected upon on the fly from a contiguously loaded batch so as to avoid unnecessary data reads. Indexed reads from h5 are very slow.
Is there an example when we cannot cut on the loaded jets at the same time as we compute the indices?
am not sure what do you mean by indices being "computed". Can you elaborate?
Indexed reads are indeed slow but a sliced read [start:finish] is pretty fast and it seems to be a nice feature to have especially given that Umami seems to be using that. For now the H5reader can only start reading from the start. I think it shouldn't make to much of a problem to implement other starting point.
It might be also useful if one wants to split the data into seral subsets and read only one of those.
I assume the propsed indices_to_load
argument will accept some previously computed indices that correspond to jets passing some given selection.
The question is why not immediately apply the selection (cutting away the failing jets), rather than storing only the intermediate indices for later use.
For now it seems to only be used here: https://gitlab.cern.ch/atlas-flavor-tagging-tools/algorithms/umami/-/blob/master/scripts/decorate_tagger_values.py#L130
Which should be likely possible to rewrite with the current functionality. But still I think that the sliced load might be useful in the future.
I would argue we should avoiding adding it without a clear use case as it can encourage poorly structued workflows involving unnecessary and expensive reads from disk, what do you think?
I think we agreed not to do this, @IvanOleksiyuk feel free to complain if you disagree
This has to be a tule for reading from indices_to_load[0] to indices_to_load[1]. Needed to be integrated in umami/data_tools/loaders.py