pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.69k stars 22.27k forks source link

Enable OnDemand for Open Source CI #92838

Open drisspg opened 1 year ago

drisspg commented 1 year ago

🚀 The feature, motivation and pitch

A Common Scenario

You submit PR to PyTorch and and a test fails on a CI machine that is significantly different from your local coding environment.

Instead of using with-ssh you can quickly fire up a github codespaces or other OnDemand instance that has your code checked out and you are free to debug without the time limit.

Alternatives

An alternative solution is to tag the failing pull request with the label with-ssh. This will allow a user to ssh into the CI machine and gives the user 2 hours (I think more time can be requested) to try and debug the error. This is far better than nothing but for non vim users and general UX it can be hard to complete the required debug work.

Additional context

Besides debugging purposes this could also be very helpful for first time contributors. Having a ready made environment that has been streamlined for PyTorch development.

ZainRizvi commented 1 year ago

Next step: Talk to Github to see if our current hardware can be plugged into this system

ZainRizvi commented 1 year ago

Github Codespaces officially doesn't support this, as per their FAQ, but reached out to them to see if there's something that can be done.

@drisspg, is this the OnDemand you were suggesting as a potential alternative?

drisspg commented 1 year ago

Ohh I actually wasn't sure of another alternative

malfet commented 1 year ago

Should we start small and figure out how to do for CPU runners (Linux, Windows, MacOS)?

ZainRizvi commented 1 year ago

Q for @drisspg: Would you find a CPU-only linux runner valuable?

Trying to see how minimal we can go while still providing value. For example, if we offered a linux docker container with [your-choice-of-env-config] setup and with simple instructions for how to build/test pytorch locally inside there, would that take care of a significant chunk of your needs?

drisspg commented 1 year ago

For my particular use case I don't thank that would be very valuable. At least for me the two most common hard to debug CI/CD issues are from windows builds or GPU runners. I do my development on a linux machine hence easy access to windows dev environment would be great, although this might just be particular to my case. I do think that the combinatorics of different gpu hardware make it the hardest to reproduce locally if you code is failing only on CUDA device X.

BUT depending on access rights I could see this being very helpful. Specifically First time contributors, and depending on how easy it is to use, maybe fixing smallish issues would be helpful.

drisspg commented 1 year ago

Update - as a part of my BE project I plan to pick this up more fully and explore the different options available. I will post regular updates here