graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
91 stars 93 forks source link

get_all_indices does not do what it says it does #734

Open eliotgenton opened 2 months ago

eliotgenton commented 2 months ago

https://github.com/graphnet-team/graphnet/blob/652f1948ed5c8c5a380105c5b5461a09c1e56d6a/src/graphnet/data/dataset/parquet/parquet_dataset.py#L192

I believe that this function is not intended to do this as this just returns the number of parquet files in a folder

RasmusOrsoe commented 2 months ago

@eliotgenton thanks. The selection argument in the ParquetDataset refers to the batches of chunked, pre-shuffled batches of data used for training. This is different from the SQLite dataset where the argument specifies individual events, because that data format provides fast random access to individual rows, making it possible to shuffle on the fly. As a result, the _get_all_indices methods are different - as you point out, the function in ParquetDataset returns the total amount of batches (files) available in the directory specified by the user.

I think this is indeed the intended usage of the method, but we could add statements to make this distinction clearer.