Torchx integration - Githubissues

vwxyzjn commented 1 year ago

Description

Our current cloud integration is pretty hacky. I haven't seen anyone used it and it has been a maintenance burden for us. Using a more managed utility to launch experiments in the cloud is desirable. There are two primary contenders and their pros and cons:

torchx
- ✅ support for slurm
- ✅ support for running tasks locally
- ✅ the docker image is automatically pushed with a hash for AWS Batch
- ❌ still need to spin up cloud resources (e.g., aws batch), which is complicated but can be mitigated by using terraform
skypilot
- ✅ support for managing spot instances and auto resume them
- ✅ compare pricing
- ✅ debuggability via sky ssh mycluster
  - ✅ good for folks who don't always have a GPU machine
- ❌ need to wait for the clusters to be spun up

All of them:

✅ support for aws, gcp, azure

This PR

Better cloud integration utility by leveraging torchx. It should really be an elegant solution for us and has the following benefits:

we can deprecate our cloud utilities and release ourselves from their maintenance burden
support for slurm, kubernetes, aws batch, gcp (https://github.com/pytorch/torchx/issues/410#issuecomment-1301186265) and others

Give it a try by running

poetry run torchx run --scheduler local_docker utils.python --gpu 1 --script cleanrl/cleanrl.py
poetry run torchx run --scheduler aws_batch --scheduler_args queue=c5a-large,image_repo=vwxyzjn/cleanrl  utils.python  --script cleanrl/ppo.py
poetry run torchx status aws_batch://torchx/c5a-large:torchx_utils_python-pn9sx3wzq0qcwd

Types of changes

[ ] Bug fix
[x] New feature
[ ] New algorithm
[ ] Documentation

Checklist:

[ ] I've read the CONTRIBUTION guide (required).
[ ] I have ensured pre-commit run --all-files passes (required).
[ ] I have updated the documentation and previewed the changes via mkdocs serve.
[ ] I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

[ ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
[ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
[ ] I have added additional documentation and previewed the changes via mkdocs serve.
- [ ] I have explained note-worthy implementation details.
- [ ] I have explained the logged metrics.
- [ ] I have added links to the original paper and related papers (if applicable).
- [ ] I have added links to the PR related to the algorithm variant.
- [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- [ ] I have added the learning curves (in PNG format).
- [ ] I have added links to the tracked experiments.
- [ ] I have updated the overview sections at the docs and the repo
[ ] I have updated the tests accordingly (if applicable).

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jan 1, 2023 at 3:14PM (UTC)

vwxyzjn commented 1 year ago

Closing this for now. We are likely going for a slurm integration in the future such as https://github.com/vwxyzjn/cleanba/blob/a61c51214d44cbfcc055c77676c351fdeeb5e6cc/benchmark.sh#L3-L13

vwxyzjn / cleanrl

Torchx integration #321

Description

This PR

Types of changes

Checklist: