Closed satyaog closed 2 years ago
@lebrice , @breuleux , @abergeron , I'm also wondering if this should or not be included in the milatools set? I think I would like an open discussion on that?
We should probably include all those who gave ideas here also right? https://github.com/mila-iqia/milatools/issues/13
I think it would cleaner and easier to switch from format to format if the selection of, for example, ImagenetDataModule vs ImagenetFfcvDataModule should be automatic and/or selected through an optional argument (ImagenetDataModule(var="ffcv"))
I disagree. They behave differently, and the FFCV version takes quite a few additional configuration options. It's also a good idea IMO to keep this separation of concerns, where the ImageNet datamodule creates the regular imagenet stuff, and the subclass just adds FFCV on top of that. There may come a time, if FFCV works really well, where we could create a generic FFCV "wrapper" that can be applied to other datamodules.
/network/datasets/ should be enclosed in a func and used when needed to replace the root of the datasets dir on the different clusters as the suffix of the dataset relative path will be the same through all clusters
Sure, I agree. This is the idea of the get_dataset_root
function (found here: https://github.com/lebrice/mila_datamodules/blob/69dab9fd22fa4a1a256ea987c2c3957c77dea7f6/mila_datamodules/registry.py#L78)
There may be differences in naming. This is a minor detail, it doesn't really matter imo. I just think that it's a good idea to make the least amount of assumptions possible about the naming and location of the datasets for each cluster. That way, we might have the opportunity to simplify / unify stuff later. Otherwise, we'd have to add special cases for newer clusters / dataset, and it would become a bit of a mess.
Very cool!
ImagenetDataModule
vsImagenetFfcvDataModule
should be automatic and/or selected through an optional argument (ImagenetDataModule(var="ffcv")
) https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/imagenet/imagenet.py#L76 https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/imagenet/imagenet_ffcv.py#L174/network/datasets/
should be enclosed in a func and used when needed to replace the root of the datasets dir on the different clusters as the suffix of the dataset relative path will be the same through all clusters. (ex.: on beluga,/network/datasets
could be replaced by${HOME}/projects/[rrg-bengioy-ad|or something else]/data/curated
) https://github.com/lebrice/mila_datamodules/blob/master/mila_datamodules/vision/coco.py#L9-L19datsaets/restricted
and a public version indatasets
). If the path could be used to identify the dataset, we could have something likemila_datamodules.dataset.get("inat")
andmila_datamodules.dataset.get("mimiciii")
and still allow users to access the raw data indatasets/restricted
with something likemila_datamodules.dataset.get("restricted/inat")
. The datasets preprocessed by torchvision could also be loaded by default or explicited withmila_datamodules.dataset.get("inat", var="torchvision")
which in this case would load the dataset located indatasets/inat.var/inat_torchvision
. So, datasets path would be used to identify which dataset root the user whats to use and thevar
arg would be used to get theDATASET.var/DATASET_${var}
.Small notes: on big datasets like c4 for which the user might not want to train on the full dataset, a way to glob some files to extract could be interesting. There are some datasets like the scannet for which the different files types could be cumbersome to implement for all possible use case. I wonder if it could be possible to abstract the location and some of the preparation steps while give the freedom to cherry-pick the parts of the datasets that is needed for training (along with annex orders when outputted (something like
(image, label, segmentation1, segmentation2, depth, ...)
)