Open eliotgenton opened 2 months ago
@eliotgenton thanks. The selection argument in the ParquetDataset refers to the batches of chunked, pre-shuffled batches of data used for training. This is different from the SQLite dataset where the argument specifies individual events, because that data format provides fast random access to individual rows, making it possible to shuffle on the fly. As a result, the _get_all_indices
methods are different - as you point out, the function in ParquetDataset returns the total amount of batches (files) available in the directory specified by the user.
I think this is indeed the intended usage of the method, but we could add statements to make this distinction clearer.
https://github.com/graphnet-team/graphnet/blob/652f1948ed5c8c5a380105c5b5461a09c1e56d6a/src/graphnet/data/dataset/parquet/parquet_dataset.py#L192
I believe that this function is not intended to do this as this just returns the number of parquet files in a folder