mila-iqia / mila_datamodules

Efficient Datamodules Customized for the Mila / CC clusters
MIT License
4 stars 3 forks source link

[WIP] Discussion #1

Closed satyaog closed 2 years ago

satyaog commented 2 years ago

Very cool!

Small notes: on big datasets like c4 for which the user might not want to train on the full dataset, a way to glob some files to extract could be interesting. There are some datasets like the scannet for which the different files types could be cumbersome to implement for all possible use case. I wonder if it could be possible to abstract the location and some of the preparation steps while give the freedom to cherry-pick the parts of the datasets that is needed for training (along with annex orders when outputted (something like (image, label, segmentation1, segmentation2, depth, ...))

satyaog commented 2 years ago

@lebrice , @breuleux , @abergeron , I'm also wondering if this should or not be included in the milatools set? I think I would like an open discussion on that?

satyaog commented 2 years ago

We should probably include all those who gave ideas here also right? https://github.com/mila-iqia/milatools/issues/13

lebrice commented 2 years ago

I think it would cleaner and easier to switch from format to format if the selection of, for example, ImagenetDataModule vs ImagenetFfcvDataModule should be automatic and/or selected through an optional argument (ImagenetDataModule(var="ffcv"))

I disagree. They behave differently, and the FFCV version takes quite a few additional configuration options. It's also a good idea IMO to keep this separation of concerns, where the ImageNet datamodule creates the regular imagenet stuff, and the subclass just adds FFCV on top of that. There may come a time, if FFCV works really well, where we could create a generic FFCV "wrapper" that can be applied to other datamodules.

/network/datasets/ should be enclosed in a func and used when needed to replace the root of the datasets dir on the different clusters as the suffix of the dataset relative path will be the same through all clusters

Sure, I agree. This is the idea of the get_dataset_root function (found here: https://github.com/lebrice/mila_datamodules/blob/69dab9fd22fa4a1a256ea987c2c3957c77dea7f6/mila_datamodules/registry.py#L78)

There may be differences in naming. This is a minor detail, it doesn't really matter imo. I just think that it's a good idea to make the least amount of assumptions possible about the naming and location of the datasets for each cluster. That way, we might have the opportunity to simplify / unify stuff later. Otherwise, we'd have to add special cases for newer clusters / dataset, and it would become a bit of a mess.