graphnet-team / graphnet

A Deep learning library for neutrino telescopes
https://graphnet-team.github.io/graphnet/
Apache License 2.0
94 stars 94 forks source link

Add `PublicBenchmarkDataset` & `SecretDataset` #747

Closed RasmusOrsoe closed 2 months ago

RasmusOrsoe commented 2 months ago

This PR adds extensions of ERDAHostedDataset that allows us to build and share public benchmarking datasets, and secret ones! It also introduces functionality to ParquetDataset that removes chunk ids from selection that doesn't exist.

Below is an example of the syntax of SecretDataset - a way for us to share datasets using ERDA sharelinks:

from graphnet.data import SecretDataset

dm = SecretDataset(secret= "secret-erda-sharelink",
                   graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader

The idea here is that we can distribute datasets "secretly" to colleagues, and once the data is ready to be made public, the data can be made available through the PublicBenchmarkDataset by subclassing, providing a similar syntax:

from graphnet.datasets import ABenchmarkDataset 

dm = ABenchmarkDataset(
                    graph_definition= ... ,
                    download_dir="/home/cool-datasets/",
                    backend = 'parquet',
                    mode = 'train')

training_dataloader = dm.train_dataloader
validation_dataloader = dm.val_dataloader
test_dataloader = dm.test_dataloader
RasmusOrsoe commented 2 months ago

@Aske-Rosted thanks for taking a look. Looks like I by mistake managed to merge another branch into this one, causing the checks to fail. I think your comments on the toggles between "test", "train" and "no-noise" is fair - and is granted quite specific to what I intend to use it for. I'll close this PR and make a new one in the future.