Separate models/datasets from composer

Landanjs commented 2 years ago

🚀 Feature Request

Move code for models (ResNet, GPT, U-Net, etc.) and datasets (ImageNet, BraTS, etc.) to a separate repository.

Motivation

Cory mentioned this to me a week ago and it has been bothering me, so I am curious to hear others' thoughts!

Currently, any new model + dataset pair mosaic would like to test is added directly to the composer repository. Then, most internal researchers use python run_mosaic_trainer.py -f model_dataset.yaml for experimentation. Two points from Cory made me think this may not be best:

Potentially different interfaces between internal and external users. External users will likely have their own dataset and models, then setup a script using composer to fit their model. Internal users rely on the example script, creating a divergence in experience.
When adding a new model / dataset there is some discussion whether we should use a model from another library (HuggingFace, TIMM, mmsegmentation, etc.). One concern is adding another dependency to composer that will need to be maintained. By separating models from composer, it will be easier to use other libraries for models and decrease dependencies in composer.

To me, composer is the best training loop and the best way to compose training algorithms to further improve the training loop. I'm not sure (at least for now) if we also want to try to build the best models and datasets while there are already several libraries providing extensive lists of models.

I think everyone may have an opinion on this, so feel free to contribute to the discussion! The people with the strongest opinion may be @hanlint, @ravi-mosaicml, @jbloxham, @moinnadeem, @A-Jacobson, @florescl, @abhi-mosaic, @coryMosaicML

growlix commented 2 years ago

I don't have a ton to add atm, other than vissl might be a decent example of how to make it easy for users to bring their own dataset (also see their code structure)

ravi-mosaicml commented 2 years ago

Really like this idea to support models and datasets not defined in the composer registry. I think this gets at that we need better registries (ModelRegistry, DatasetRegistry, AlgorithmRegsitry).

From a CI/CD POV, it would be much easier if we could keep everything in one repo. So, for common models like resents and datasets like imagenet, I would be for leaving those in composer. However, I don't think that the trainer should treat models and datasets defined in the models and datasets folder differently than models and datasets defined somewhere else. More specifically, composer.trainer should only import base classes from composer.models and composer.datasets by default, not specific models or datasets.

One design that comes to mind is the user specifies a path to a python file to import (whether that be in the composer repo, someplace else on disk, a s3 url, a remote git repo, etc...). This file would contain ModelHparams class. We can dynamically import these classes at runtime. For example:

hparams.yaml

model:
    module: path/to/resnet56.py
    parameters:  # <- passed to path/to/resnet56.py:ModelHparams
        num_classes: 1000

trainer.py

from importlib import import_module

def create_from_hparams(trainer_hparams):
  model_module = import_module(trainer_hparams.model.module)
  ModelHparams = model_module.ModelHparams
  model_hparams = ModelHparams.create_from_dict(trainer_hparams.model.parameters)
  model = model_hparams.initailize_object()

And something similar for datasets.

mvpatel2000 commented 1 year ago

Closing for now, we will eventually do :)

mosaicml / composer

Separate models/datasets from composer #171

🚀 Feature Request

Motivation