[WIP] Discussion - Githubissues

satyaog commented 2 years ago

Very cool!

I think it would cleaner and easier to switch from format to format if the selection of, for example, ImagenetDataModule vs ImagenetFfcvDataModule should be automatic and/or selected through an optional argument (ImagenetDataModule(var="ffcv")) https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/imagenet/imagenet.py#L76 https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/imagenet/imagenet_ffcv.py#L174
/network/datasets/ should be enclosed in a func and used when needed to replace the root of the datasets dir on the different clusters as the suffix of the dataset relative path will be the same through all clusters. (ex.: on beluga, /network/datasets could be replaced by ${HOME}/projects/[rrg-bengioy-ad|or something else]/data/curated) https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/coco.py#L9-L19
If we manage to do the above in a smart way, maybe we could also use the path of the dataset to identify which one we are using. For example, there are 2 inat and mimiciii datasets (each one has a restricted version in datsaets/restricted and a public version in datasets). If the path could be used to identify the dataset, we could have something like mila_datamodules.dataset.get("inat") and mila_datamodules.dataset.get("mimiciii") and still allow users to access the raw data in datasets/restricted with something like mila_datamodules.dataset.get("restricted/inat"). The datasets preprocessed by torchvision could also be loaded by default or explicited with mila_datamodules.dataset.get("inat", var="torchvision") which in this case would load the dataset located in datasets/inat.var/inat_torchvision. So, datasets path would be used to identify which dataset root the user whats to use and the vararg would be used to get the DATASET.var/DATASET_${var}.

Small notes: on big datasets like c4 for which the user might not want to train on the full dataset, a way to glob some files to extract could be interesting. There are some datasets like the scannet for which the different files types could be cumbersome to implement for all possible use case. I wonder if it could be possible to abstract the location and some of the preparation steps while give the freedom to cherry-pick the parts of the datasets that is needed for training (along with annex orders when outputted (something like (image, label, segmentation1, segmentation2, depth, ...))

satyaog commented 2 years ago

@lebrice , @breuleux , @abergeron , I'm also wondering if this should or not be included in the milatools set? I think I would like an open discussion on that?

satyaog commented 2 years ago

We should probably include all those who gave ideas here also right? https://github.com/mila-iqia/milatools/issues/13

lebrice commented 2 years ago

I think it would cleaner and easier to switch from format to format if the selection of, for example, ImagenetDataModule vs ImagenetFfcvDataModule should be automatic and/or selected through an optional argument (ImagenetDataModule(var="ffcv"))

I disagree. They behave differently, and the FFCV version takes quite a few additional configuration options. It's also a good idea IMO to keep this separation of concerns, where the ImageNet datamodule creates the regular imagenet stuff, and the subclass just adds FFCV on top of that. There may come a time, if FFCV works really well, where we could create a generic FFCV "wrapper" that can be applied to other datamodules.

/network/datasets/ should be enclosed in a func and used when needed to replace the root of the datasets dir on the different clusters as the suffix of the dataset relative path will be the same through all clusters

Sure, I agree. This is the idea of the get_dataset_root function (found here: https://github.com/lebrice/mila_datamodules/blob/69dab9fd22fa4a1a256ea987c2c3957c77dea7f6/mila_datamodules/registry.py#L78)

There may be differences in naming. This is a minor detail, it doesn't really matter imo. I just think that it's a good idea to make the least amount of assumptions possible about the naming and location of the datasets for each cluster. That way, we might have the opportunity to simplify / unify stuff later. Otherwise, we'd have to add special cases for newer clusters / dataset, and it would become a bit of a mess.

mila-iqia / mila_datamodules

[WIP] Discussion #1