Open breuleux opened 2 years ago
Here is my opening suggestion for how to do it. Don't hesitate to criticize it :)
milatools
torch
datasets
class CachedDataset
class Split
...
ImageFolder = milatools.env.get_current().get_builder("torchvision.datasets.ImageFolder")
MNIST = milatools.env.get_current().get_builder("torchvision.datasets.MNIST")
...
env
def get_current
class ComputeEnvironment
mila
class MilaEnvironment
get_builder("torchvision.datasets.MNIST") = partial(milatools.torch.datasets.CachedDataset, src=self.dataroot / "MNIST", dst=self.scratch / "MNIST")
cc
class ComputeCanadaEnvironment
...
Basically, milatools.torch.datasets
would just be shims that ask the current environment for the appropriate dataset constructor, plus some helpers like CachedDataset
. On our cluster, MilaEnvironment
would have the responsibility of mapping constructors such as torchvision.datasets.MNIST
to an adapted or optimized version. Ultimately, milatools.env.get_current
would be in charge of figuring out what the environment is.
Figuring out the environment: I believe the most robust way to determine whether we are running on the Mila cluster or not is to use the fully qualified domain name: socket.getfqdn().endswith(".server.mila.quebec")
.
Environments we may want to support, in priority order
Copying the dataset: Someone please confirm, but my understanding is that we ideally want to copy from either /network/datasets/X
(or a user-defined location if the dataset is non-standard) to $SLURM_TMPDIR/X
; if the dataset has to be untarred or unzipped, do it directly from the source into the destination.
Arbitrary datasets: We probably also want a generic system for arbitrary user datasets that are not in the official datasets list. Here's an idea for a possible low level interface:
path = milatools.datasets.resolve(
# Places where the data can be found
template={"name": "X", "tmp": os.environ["SLURM_TMPDIR"]},
sources=[
"local://../data/{name}",
"mila:///network/datasets/{name}",
"mila:///network/scratch/u/user/{name}",
"https://download.site",
"s3://blah",
"magnet:?xt=abcdef",
],
# Place where the data can be cached (if downloaded)
cache=["mila:///network/scratch/u/user/{name}"],
# Places where the data must go
destinations=["local://../data/{name}", "mila://{tmp}/{name}"],
# Whether the data should be untarred, unzipped, etc.
unpack=True,
# Optionally, md5 hash for verification
md5="xyz"
)
data = ImageFolder(path)
I am sure the interface can be improved, but the basic idea is to let the user specify many paths for the data, separated into "sources" where it can be found, "cache" where the download can be cached, and "destinations" where it must be copied or unpacked for actual processing. In a local setting, the same path can be used for all roles. Each environment could fill in some paths by default, so that e.g. the mila://
paths would be added automatically, and local://
could be the assumed default.
I would add a env://
location so people can override our local://
default to something sensible for their system
I agree with the general ideas described above. Yes I think we want to use $SLURM_TMPDIR
as the destination whenever we can. Some things I think we could keep in mind is datasets that do not fit in $SLURM_TMPDIR
, in which case the cache location could be used.
While I think it makes sense to return torchvision / pytorch datasets instances whenever we can, we should not commit on that feature for all datasets. A lot of datasets do not follow pytorch's standard structure and some are sufficiently complex for me not to want to understand how the different auxiliaries interact with each others and which one is used in what case. So we should also allow to retrieve the location of the dataset's root in destination for custom datasets classes to be implemented by the users when necessary. We could allow users to then submit as PRs the classes but this would add an other layer of complexity to handle as it likely that some datasets classes would not generalize on all use cases, specifically when a datasets contains multiple data types (audio, video, images, sensors data, ...). In those case, an array of datasets classes well documented could be necessary. So I see the return of the pytorch's datasets as a strong feature but still something that could be implemented in a second step.
ImageFolder = milatools.env.torchvision.datasets.ImageFolder
MNIST = milatools.env.torchvision.datasets.MNIST
Wouldn't the above be cleaner?
Later the users can go as far as:
import milatools.env.torchvision as torchvision
and have to change no code (ideally)
get_current() is anyway constant and we can do the get_builder behind the scenes the first time.
Figuring out the environment I think it should be hidden away like here, and then we improve the heuristic as bugs/usecases come up https://github.com/mila-iqia/milatools/blob/35676c3bdde67188e00dbb550d9e170081d8d383/milatools/utils.py#L4
About multi framework support we can have something of the kind
milatools/datasets.py
try:
import torch
except ImportError as e:
warnings.warn("`pytorch` not installed To have pytorch support in milatools.datasets please `pip install pytorch` or whaever")
else:
from milatools.datasets.torch import *
So that we just don't load what is not available, if they want to import the module anyway, they can use the full path, but that's on them.
@Delaunay
I would add a env:// location so people can override our local:// default to something sensible for their system
I'm not sure what you mean. Also I may have expressed myself poorly, what I meant by "local://
could be the assumed default" was that it would be the default protocol if you just write path/to/data
.
@satyaog
While I think it makes sense to return torchvision / pytorch datasets instances whenever we can, we should not commit on that feature for all datasets.
I'm thinking that we should support the range of ways people typically use to load datasets. If some people use torchvision, we should have a compatible wrapper, if others do something custom with the data on disk, we should have something to help them ensure the data is in the proper place and return the dataset's final location, and so on.
@manuel-delverme
Wouldn't the above be cleaner?
Possibly, but it depends on how the environments declare these things, e.g. if they are attributes or keys in a dictionary. It's an implementation detail.
Later the users can go as far as:
import milatools.env.torchvision as torchvision
and have to change no code (ideally)
True, that would be ideal. Or just import milatools.torchvision as torchvision
.
get_current() is anyway constant and we can do the get_builder behind the scenes the first time.
True, it could be done at the module level.
I think it should be hidden away like here, and then we improve the heuristic as bugs/usecases come up
It would be hidden away for sure, but it's best to get it right the first time ;)
The main point I want to make about environments is that I would prefer to test for which cluster we are on, and use the result of that test to switch in the appropriate set of dataloaders wholesale from milatools.env.mila
, milatools.env.cc
, etc., than to, say, have a single MNIST
function with a bunch of if statements. Alternatively, environments could offer a common interface that MNIST
could use.
About multi framework support we can have something of the kind ...
try: import torch; ... from milatools.datasets.torch import *
That would suggest that the symbols available in milatools.datasets
would vary depending on what modules you have installed. If there is any overlap in the symbols exported by, say, milatools.datasets.torch
and milatools.datasets.tensorflow
, then the mere action of installing TensorFlow in a venv which already has PyTorch (or vice versa) could cause unexpected breakage.
I think it's better to require the user to be explicit about what they want.
I suggest that we first focus our efforts on providing a drop-in replacement for the torchvision.datasets module, as was mentioned above, and make it public as a single, simple package on PyPI.
I suggest the following as design requirements:
I stared coding up / brainstorming a few things around this, here's what I have so far: https://github.com/lebrice/milatools/tree/milavision/milavision
Please do let me know what you think.
@lebrice
I am somewhat agnostic about whether we should have a separate package or an omnibus one. I will note that putting everything together still won't amount to a large amount of code and that we can separate optional dependencies as pip install milatools[torchvision]
.
Also, torchvision is one thing, but we also have to address the general problem of dataset management: if I want to use data that can be downloaded from URI X (regardless of its format), how do I get it where it needs to be in all compute environments I want to run experiments in? Can I cache it? Where can I cache it? This is more or less what a milavision package has to do with the torchvision datasets, but it is more general than torchvision: someone who is using tf or jax for their project should have access to this functionality without having a torch dependency. But to avoid code duplication, wherever it is that we put this generic functionality, it's going to be a dependency of milavision. So now we have two packages.
Support for all datasets of torchvision.datasets: If we don't have a dataset stored or there is any error, falls back to the default behaviour of torchvision.datasets
Even if we don't have a given dataset, we should still override the default behavior to download it to SLURM_TMPDIR and possibly cache it in the user's scratch (possibly using a flag).
Also, torchvision is one thing, but we also have to address the general problem of dataset management
I agree, but I'd argue that we need to nail the first problem before moving on to the second.
@breuleux
I'm thinking that we should support the range of ways people typically use to load datasets. If some people use torchvision, we should have a compatible wrapper, if others do something custom with the data on disk, we should have something to help them ensure the data is in the proper place and return the dataset's final location, and so on.
I'm more concern about the commitment that we aim at, not that we should not implement a the loaders for the datasets we can. We could maybe have a look into the ami
dataset on the cluster. That one looks to me like a good example of a dataset we are most likely not fit to implement a loader compared to what a user that knows it could do (and provide a PR or a paper we could use as a reference to implement a loader that make sense). Another example would be the restricted scannet dataset (made by the Standford U, like for imagenet, which aims to be the imagenet for the robotic domain):
ortizgas@login-2:~$ ls -lL /network/datasets/restricted/scannet_users/scannet/scans/scene0000_00/
total 4071470
-r--r--r-- 1 1008 2001 61357526 Jul 8 01:37 scene0000_00_2d-instance-filt.zip
-r--r--r-- 1 1008 2001 103278606 Jul 8 01:37 scene0000_00_2d-instance.zip
-r--r--r-- 1 1008 2001 66389949 Jul 8 01:37 scene0000_00_2d-label-filt.zip
-r--r--r-- 1 1008 2001 112937124 Jul 8 01:37 scene0000_00_2d-label.zip
-r--r--r-- 1 1008 2001 8850 Jul 8 01:37 scene0000_00.aggregation.json
-r--r--r-- 1 1008 2001 3699853377 Jul 8 01:37 scene0000_00.sens
-r--r--r-- 1 1008 2001 702 Jul 8 01:37 scene0000_00.txt
-r--r--r-- 1 1008 2001 478809 Jul 8 01:37 scene0000_00_vh_clean_2.0.010000.segs.json
-r--r--r-- 1 1008 2001 3461577 Jul 8 01:37 scene0000_00_vh_clean_2.labels.ply
-r--r--r-- 1 1008 2001 3298819 Jul 8 01:37 scene0000_00_vh_clean_2.ply
-r--r--r-- 1 1008 2001 8850 Jul 8 01:37 scene0000_00_vh_clean.aggregation.json
-r--r--r-- 1 1008 2001 106433894 Jul 8 01:37 scene0000_00_vh_clean.ply
-r--r--r-- 1 1008 2001 11673355 Jul 8 01:37 scene0000_00_vh_clean.segs.json
Writing a loader for those that would not be specific to a particular use case looks out of my reach as I don't even know which files contains what data and if they all need to be processed (extracted, copied, loader, ...) at any time.
With this in mind, I would be in favour of implementing a good base focused on providing the dataset in a location that make sense (with cache and other stuff), which looks to me where we have the most gain in value as the next step of integrating the path in the existing torchvision datasets/loader is not much more work for the user (or us to implement a layer on top of it). Once the data is in place, writing a dataset class for it should also not be much more work for the user (or us). That is until we start using a better dataset format when applicable (with maybe something like bcachefs ;)).
@lebrice
I agree, but I'd argue that we need to nail the first problem before moving on to the second.
Fair point, but in my mind the second problem is sort of a prerequisite to solve the first one. Possibly, though, I might be underestimating its relative complexity. I might try my hand at the general, non-torchvision problem and see how it goes.
@satyaog
You are quite right that making loaders for some datasets is beyond our expertise and/or may fail to match user requirements. In these situations, I agree with your assessment: what we want is pure filesystem/http tooling that merely takes care of moving the "stuff" to the right place. I think that is both the least and the most we can do for these datasets.
One thing apparent in your examples that I hadn't thought of is that the user may only need a subset of the files, in which case we may only want to copy these over.
I have some ideas about this that I might draft as code somewhere.
Purpose
Provide an interface, akin to
from milatools.datasets.torch import ImageNet
, to acquire and use certain datasets in a location aware way. For example, when running on the Mila cluster, it should copy ImageNet from its standard location in/networks/datasets
to the local scratch. Elsewhere, it should use the user provided path (if necessary, download it).Given existing attempts in both #8 and #10 by @manuel-delverme and @delaunay I'm going to take a step back and create this issue so that we have a central way to discuss this. I want to make sure we are all on the same page.
Requisites
torchvision
ortorchtext
interface, so all the user needs to do is swap out the import:Supports important locations: the user's local machine, the Mila cluster, the CC cluster, possibly others in the future. For this reason I believe the code should be modular with respect to the concept of "location" or "environment".
Supports important frameworks: for now this is arguably just PyTorch, but we need to be careful not to induce unnecessary dependencies on a single framework. If e.g. a large number of researchers move to Jax we should be able to gracefully support it without clashing with PyTorch support.
Anyone is welcome to contribute to this discussion.