pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Use torch generic workflow for CI, add ssh, artifacts #325

Closed wconstab closed 1 month ago

wconstab commented 1 month ago

Stack from ghstack (oldest at bottom):

This moves over to using the standard pytorch CI job template. (doc).

The general advantages should be that we can more easily add features or options in a maintained way. A specific reason is becuase I was not able to ssh-debug on our old CI and @seemethere mentioned that the 'generic workflow' is where the CI SSH support lives.

SSH Use ssh just like pytorch/pytorch CI:

image

Artifacts Uploading The job.dump_folder for each test is uniquely named and bundled into an outputs.zip which can be downloaded from github actions UI:

image

To implement the artifacts upload, the following changes are made to test_runner.py