Implement cluster-aware dataloaders

mila-iqia / milatools

Tools to connect to and interact with the Mila cluster

MIT License

63 stars 12 forks source link

Implement cluster-aware dataloaders #13

Open breuleux opened 2 years ago

breuleux commented 2 years ago

Purpose

Provide an interface, akin to from milatools.datasets.torch import ImageNet, to acquire and use certain datasets in a location aware way. For example, when running on the Mila cluster, it should copy ImageNet from its standard location in /networks/datasets to the local scratch. Elsewhere, it should use the user provided path (if necessary, download it).

Given existing attempts in both #8 and #10 by @manuel-delverme and @delaunay I'm going to take a step back and create this issue so that we have a central way to discuss this. I want to make sure we are all on the same page.

Requisites

Ease of use. Whenever possible, when using torch, we should mirror the torchvision or torchtext interface, so all the user needs to do is swap out the import:

from milatools.datasets.torch import CIFAR10
...
# On local machine: uses ../data
# On the Mila cluster: uses /network/datasets/whatever_the_cifar_path_is
data = CIFAR10("../data")

Supports important locations: the user's local machine, the Mila cluster, the CC cluster, possibly others in the future. For this reason I believe the code should be modular with respect to the concept of "location" or "environment".
Supports important frameworks: for now this is arguably just PyTorch, but we need to be careful not to induce unnecessary dependencies on a single framework. If e.g. a large number of researchers move to Jax we should be able to gracefully support it without clashing with PyTorch support.

Anyone is welcome to contribute to this discussion.

breuleux commented 2 years ago

Here is my opening suggestion for how to do it. Don't hesitate to criticize it :)

Code organization

milatools
   torch
      datasets
         class CachedDataset
         class Split
         ...
         ImageFolder = milatools.env.get_current().get_builder("torchvision.datasets.ImageFolder")
         MNIST = milatools.env.get_current().get_builder("torchvision.datasets.MNIST")
         ...
   env
      def get_current
      class ComputeEnvironment
      mila
         class MilaEnvironment
            get_builder("torchvision.datasets.MNIST") = partial(milatools.torch.datasets.CachedDataset, src=self.dataroot / "MNIST", dst=self.scratch / "MNIST")
      cc
         class ComputeCanadaEnvironment
            ...

Basically, milatools.torch.datasets would just be shims that ask the current environment for the appropriate dataset constructor, plus some helpers like CachedDataset. On our cluster, MilaEnvironment would have the responsibility of mapping constructors such as torchvision.datasets.MNIST to an adapted or optimized version. Ultimately, milatools.env.get_current would be in charge of figuring out what the environment is.

Details

Figuring out the environment: I believe the most robust way to determine whether we are running on the Mila cluster or not is to use the fully qualified domain name: socket.getfqdn().endswith(".server.mila.quebec").
Environments we may want to support, in priority order
- Mila cluster
- Local machine
- CC cluster
- Generic environment where $SLURM_TMPDIR is set
- Other environments that Mila members have access to
- Vector's cluster and others, if there is interest from them (I believe there would be, because they have essentially the same issues)
Copying the dataset: Someone please confirm, but my understanding is that we ideally want to copy from either /network/datasets/X (or a user-defined location if the dataset is non-standard) to $SLURM_TMPDIR/X; if the dataset has to be untarred or unzipped, do it directly from the source into the destination.
Arbitrary datasets: We probably also want a generic system for arbitrary user datasets that are not in the official datasets list. Here's an idea for a possible low level interface:

path = milatools.datasets.resolve(
    # Places where the data can be found
    template={"name": "X", "tmp": os.environ["SLURM_TMPDIR"]},
    sources=[
        "local://../data/{name}",
        "mila:///network/datasets/{name}",
        "mila:///network/scratch/u/user/{name}",
        "https://download.site",
        "s3://blah",
        "magnet:?xt=abcdef",
    ],
    # Place where the data can be cached (if downloaded)
    cache=["mila:///network/scratch/u/user/{name}"],
    # Places where the data must go
    destinations=["local://../data/{name}", "mila://{tmp}/{name}"],
    # Whether the data should be untarred, unzipped, etc.
    unpack=True,
    # Optionally, md5 hash for verification
    md5="xyz"
)
data = ImageFolder(path)

I am sure the interface can be improved, but the basic idea is to let the user specify many paths for the data, separated into "sources" where it can be found, "cache" where the download can be cached, and "destinations" where it must be copied or unpacked for actual processing. In a local setting, the same path can be used for all roles. Each environment could fill in some paths by default, so that e.g. the mila:// paths would be added automatically, and local:// could be the assumed default.

Delaunay commented 2 years ago

I would add a env:// location so people can override our local:// default to something sensible for their system

satyaog commented 2 years ago

I agree with the general ideas described above. Yes I think we want to use $SLURM_TMPDIR as the destination whenever we can. Some things I think we could keep in mind is datasets that do not fit in $SLURM_TMPDIR, in which case the cache location could be used.

While I think it makes sense to return torchvision / pytorch datasets instances whenever we can, we should not commit on that feature for all datasets. A lot of datasets do not follow pytorch's standard structure and some are sufficiently complex for me not to want to understand how the different auxiliaries interact with each others and which one is used in what case. So we should also allow to retrieve the location of the dataset's root in destination for custom datasets classes to be implemented by the users when necessary. We could allow users to then submit as PRs the classes but this would add an other layer of complexity to handle as it likely that some datasets classes would not generalize on all use cases, specifically when a datasets contains multiple data types (audio, video, images, sensors data, ...). In those case, an array of datasets classes well documented could be necessary. So I see the return of the pytorch's datasets as a strong feature but still something that could be implemented in a second step.

manuel-delverme commented 2 years ago

ImageFolder = milatools.env.torchvision.datasets.ImageFolder
MNIST = milatools.env.torchvision.datasets.MNIST

Wouldn't the above be cleaner?

Later the users can go as far as: import milatools.env.torchvision as torchvision and have to change no code (ideally)

get_current() is anyway constant and we can do the get_builder behind the scenes the first time.

manuel-delverme commented 2 years ago

Figuring out the environment I think it should be hidden away like here, and then we improve the heuristic as bugs/usecases come up https://github.com/mila-iqia/milatools/blob/35676c3bdde67188e00dbb550d9e170081d8d383/milatools/utils.py#L4

manuel-delverme commented 2 years ago

About multi framework support we can have something of the kind

milatools/datasets.py

try:
    import torch
except ImportError as e:
    warnings.warn("`pytorch` not installed To have pytorch support in milatools.datasets please `pip install pytorch` or whaever")
else:
    from milatools.datasets.torch import *

So that we just don't load what is not available, if they want to import the module anyway, they can use the full path, but that's on them.

breuleux commented 2 years ago

@Delaunay

I would add a env:// location so people can override our local:// default to something sensible for their system

I'm not sure what you mean. Also I may have expressed myself poorly, what I meant by "local:// could be the assumed default" was that it would be the default protocol if you just write path/to/data.

breuleux commented 2 years ago

@satyaog

While I think it makes sense to return torchvision / pytorch datasets instances whenever we can, we should not commit on that feature for all datasets.

I'm thinking that we should support the range of ways people typically use to load datasets. If some people use torchvision, we should have a compatible wrapper, if others do something custom with the data on disk, we should have something to help them ensure the data is in the proper place and return the dataset's final location, and so on.

breuleux commented 2 years ago

@manuel-delverme

Wouldn't the above be cleaner?

Possibly, but it depends on how the environments declare these things, e.g. if they are attributes or keys in a dictionary. It's an implementation detail.

Later the users can go as far as: import milatools.env.torchvision as torchvision and have to change no code (ideally)

True, that would be ideal. Or just import milatools.torchvision as torchvision.

get_current() is anyway constant and we can do the get_builder behind the scenes the first time.

True, it could be done at the module level.

I think it should be hidden away like here, and then we improve the heuristic as bugs/usecases come up

It would be hidden away for sure, but it's best to get it right the first time ;)

The main point I want to make about environments is that I would prefer to test for which cluster we are on, and use the result of that test to switch in the appropriate set of dataloaders wholesale from milatools.env.mila, milatools.env.cc, etc., than to, say, have a single MNIST function with a bunch of if statements. Alternatively, environments could offer a common interface that MNIST could use.

About multi framework support we can have something of the kind ... try: import torch; ... from milatools.datasets.torch import *

That would suggest that the symbols available in milatools.datasets would vary depending on what modules you have installed. If there is any overlap in the symbols exported by, say, milatools.datasets.torch and milatools.datasets.tensorflow, then the mere action of installing TensorFlow in a venv which already has PyTorch (or vice versa) could cause unexpected breakage.

I think it's better to require the user to be explicit about what they want.

lebrice commented 2 years ago

I suggest that we first focus our efforts on providing a drop-in replacement for the torchvision.datasets module, as was mentioned above, and make it public as a single, simple package on PyPI.

I suggest the following as design requirements:

User only needs to pip-install the package, and change a single line of code;
Works on every compute cluster we have;
No additional dependencies besides torchvision;
Tested weekly using CI, on every type of cluster available;
Support for all datasets of torchvision.datasets: If we don't have a dataset stored or there is any error, falls back to the default behaviour of torchvision.datasets

I stared coding up / brainstorming a few things around this, here's what I have so far: https://github.com/lebrice/milatools/tree/milavision/milavision

Please do let me know what you think.

breuleux commented 2 years ago

@lebrice

I am somewhat agnostic about whether we should have a separate package or an omnibus one. I will note that putting everything together still won't amount to a large amount of code and that we can separate optional dependencies as pip install milatools[torchvision].

Also, torchvision is one thing, but we also have to address the general problem of dataset management: if I want to use data that can be downloaded from URI X (regardless of its format), how do I get it where it needs to be in all compute environments I want to run experiments in? Can I cache it? Where can I cache it? This is more or less what a milavision package has to do with the torchvision datasets, but it is more general than torchvision: someone who is using tf or jax for their project should have access to this functionality without having a torch dependency. But to avoid code duplication, wherever it is that we put this generic functionality, it's going to be a dependency of milavision. So now we have two packages.

Support for all datasets of torchvision.datasets: If we don't have a dataset stored or there is any error, falls back to the default behaviour of torchvision.datasets

Even if we don't have a given dataset, we should still override the default behavior to download it to SLURM_TMPDIR and possibly cache it in the user's scratch (possibly using a flag).

lebrice commented 2 years ago

Also, torchvision is one thing, but we also have to address the general problem of dataset management

I agree, but I'd argue that we need to nail the first problem before moving on to the second.

satyaog commented 2 years ago

@breuleux

I'm thinking that we should support the range of ways people typically use to load datasets. If some people use torchvision, we should have a compatible wrapper, if others do something custom with the data on disk, we should have something to help them ensure the data is in the proper place and return the dataset's final location, and so on.

I'm more concern about the commitment that we aim at, not that we should not implement a the loaders for the datasets we can. We could maybe have a look into the ami dataset on the cluster. That one looks to me like a good example of a dataset we are most likely not fit to implement a loader compared to what a user that knows it could do (and provide a PR or a paper we could use as a reference to implement a loader that make sense). Another example would be the restricted scannet dataset (made by the Standford U, like for imagenet, which aims to be the imagenet for the robotic domain):

ortizgas@login-2:~$ ls -lL /network/datasets/restricted/scannet_users/scannet/scans/scene0000_00/
total 4071470
-r--r--r-- 1 1008 2001   61357526 Jul  8 01:37 scene0000_00_2d-instance-filt.zip
-r--r--r-- 1 1008 2001  103278606 Jul  8 01:37 scene0000_00_2d-instance.zip
-r--r--r-- 1 1008 2001   66389949 Jul  8 01:37 scene0000_00_2d-label-filt.zip
-r--r--r-- 1 1008 2001  112937124 Jul  8 01:37 scene0000_00_2d-label.zip
-r--r--r-- 1 1008 2001       8850 Jul  8 01:37 scene0000_00.aggregation.json
-r--r--r-- 1 1008 2001 3699853377 Jul  8 01:37 scene0000_00.sens
-r--r--r-- 1 1008 2001        702 Jul  8 01:37 scene0000_00.txt
-r--r--r-- 1 1008 2001     478809 Jul  8 01:37 scene0000_00_vh_clean_2.0.010000.segs.json
-r--r--r-- 1 1008 2001    3461577 Jul  8 01:37 scene0000_00_vh_clean_2.labels.ply
-r--r--r-- 1 1008 2001    3298819 Jul  8 01:37 scene0000_00_vh_clean_2.ply
-r--r--r-- 1 1008 2001       8850 Jul  8 01:37 scene0000_00_vh_clean.aggregation.json
-r--r--r-- 1 1008 2001  106433894 Jul  8 01:37 scene0000_00_vh_clean.ply
-r--r--r-- 1 1008 2001   11673355 Jul  8 01:37 scene0000_00_vh_clean.segs.json

Writing a loader for those that would not be specific to a particular use case looks out of my reach as I don't even know which files contains what data and if they all need to be processed (extracted, copied, loader, ...) at any time.

With this in mind, I would be in favour of implementing a good base focused on providing the dataset in a location that make sense (with cache and other stuff), which looks to me where we have the most gain in value as the next step of integrating the path in the existing torchvision datasets/loader is not much more work for the user (or us to implement a layer on top of it). Once the data is in place, writing a dataset class for it should also not be much more work for the user (or us). That is until we start using a better dataset format when applicable (with maybe something like bcachefs ;)).

breuleux commented 2 years ago

@lebrice

I agree, but I'd argue that we need to nail the first problem before moving on to the second.

Fair point, but in my mind the second problem is sort of a prerequisite to solve the first one. Possibly, though, I might be underestimating its relative complexity. I might try my hand at the general, non-torchvision problem and see how it goes.

@satyaog

You are quite right that making loaders for some datasets is beyond our expertise and/or may fail to match user requirements. In these situations, I agree with your assessment: what we want is pure filesystem/http tooling that merely takes care of moving the "stuff" to the right place. I think that is both the least and the most we can do for these datasets.

One thing apparent in your examples that I hadn't thought of is that the user may only need a subset of the files, in which case we may only want to copy these over.

I have some ideas about this that I might draft as code somewhere.