When should we update dvc.lock and dvc push in GitHub Actions?

daavoo commented 3 years ago

Discussed in https://github.com/iterative/dvc/discussions/6542

^{Originally posted by **Hongbo-Miao** September 6, 2021} Currently, my GitHub Actions workflow looks like this: when I open a pull request (change some model codes / params), CML creates a AWS EC2 instance, and DVC pull the data. Here is my [current GitHub Actions workflow](https://github.com/Hongbo-Miao/hongbomiao.com/blob/078e097d71213f9a0e24b884c4d65fd78bf0ccfd/.github/workflows/cml.yaml#L53-L129):

Click to expand!

```yaml cml-cloud-set-up-cloud: name: CML (Cloud) - Set up cloud runs-on: ubuntu-20.04 steps: - name: Cancel previous runs uses: styfle/cancel-workflow-action@0.9.1 with: access_token: ${{ github.token }} - name: Checkout uses: actions/checkout@v2 - name: Set up CML uses: iterative/setup-cml@v1 - name: Set up cloud shell: bash env: REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | cml-runner \ --cloud=aws \ --cloud-region=us-west-2 \ --cloud-type=t2.small \ --labels=cml-runner cml-cloud-train: name: CML (Cloud) - Train needs: cml-cloud-set-up-cloud runs-on: [self-hosted, cml-runner] # container: docker://iterativeai/cml:0-dvc2-base1-gpu container: docker://iterativeai/cml:0-dvc2-base1 steps: - name: Cancel previous runs uses: styfle/cancel-workflow-action@0.9.1 with: access_token: ${{ github.token }} - name: Checkout uses: actions/checkout@v2 - name: Set up Miniconda uses: conda-incubator/setup-miniconda@v2 with: miniconda-version: "latest" activate-environment: hm-cnn - name: Install requirements working-directory: convolutional-neural-network shell: bash -l {0} run: | conda install pytorch torchvision torchaudio --channel=pytorch conda install pandas conda install tabulate pip install -r requirements.txt - name: Pull Data working-directory: convolutional-neural-network env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | dvc pull - name: Train model working-directory: convolutional-neural-network shell: bash -l {0} env: WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }} run: | dvc repro - name: Write CML report working-directory: convolutional-neural-network shell: bash -l {0} env: REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }} run: | echo "# CML (Cloud) Report" >> report.md echo "## Params" >> report.md cat output/reports/params.txt >> report.md cml-send-comment report.md ```

My [dvc.yaml](https://github.com/Hongbo-Miao/hongbomiao.com/blob/078e097d71213f9a0e24b884c4d65fd78bf0ccfd/convolutional-neural-network/dvc.yaml) looks like this: ```yaml stages: prepare: cmd: tar -xf data/raw/cifar-10-python.tar.gz --dir=data/processed deps: - data/raw/cifar-10-python.tar.gz outs: - data/processed/cifar-10-batches-py/ main: cmd: python main.py deps: - data/processed/cifar-10-batches-py/ - evaluate.py - main.py - model/ - train.py params: - lr - train.epochs outs: - output/models/model.pt ``` After training, if I think the change is good because the performance is better based on the reports, - the [dvc.lock](https://github.com/Hongbo-Miao/hongbomiao.com/blob/078e097d71213f9a0e24b884c4d65fd78bf0ccfd/convolutional-neural-network/dvc.lock) I feel needs to get update. - the new model `model.pt` needs to be uploaded to AWS S3 in my case. My question is, after `dvc repro`, am I supposed to add `dvc push` and then commit? Something like ```yaml - name: Train model working-directory: convolutional-neural-network shell: bash -l {0} env: WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }} run: | dvc repro dvc push # New added git add . # New added git commit -m "update dvc.lock, etc." # New added git push origin current_pr # New added, need somehow get the current pull request name ``` This above method will apply when the pull request is open. However, I kind of feeling the best moment adding would be when I decide merging because I think this is a good pull request that actually improves the machine learning performance. But at this moment, the EC2 instance has been destroyed. Any suggestion? Thanks!

daavoo commented 3 years ago

See previous discussion comments:

https://github.com/iterative/dvc/discussions/6542#discussioncomment-1287600

hongbo-miao commented 3 years ago

Just copy another comment to here:

I am more curious about the recommended way for this demo https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml as it is using AWS EC2 instance. However, in current GitHub action workflow, it does not do something like

dvc push
cml-pr .

In the experiment pull request, it also not update the dvc.lock, etc. files, which is why I came up this question. (If we use AWS EC2 instance all time based on this approach, it will end with dvc.lock never got update.

Would be great to have some recommendations! 😃

To be more specific, it would be great to cover these scenarios and still not mess up the DVC:

Opened a pull request, and the experiment result looks good, then merge.
Opened a pull request, and the experiment result does not look good, do a force push to update. In second time, the experiment looks good, then merge.
Opened a pull request, and the experiment result does not look good, add another commit to update. In second time, the experiment looks good, then squash merge.

mattlbeck commented 3 years ago

Another option that I was taught to do was use DVC's run-cache.

In the CML runner:

dvc repro
dvc push --run-cache

On local machine

dvc pull --run-cache
dvc repro --pull
# any further checks or analysis of results
git add .
git commit -m "commit experiment"

This is less automated than the new cml-pr but has the benefit that the developers are making the commits, if that is important.

sisp commented 2 years ago

I've been wondering about all these questions myself and haven't found a satisfying answer yet.

@mattlbeck Wouldn't dvc push --run-cache pollute the DVC remote storage? The amount of run-cache data will increase quickly over time. Perhaps some kind of garbage collection is needed.

dacbd commented 2 years ago

Relevant links for any future readers:

Why? I'm assuming the use case to be for rerunning or inspecting dvc exp/repro conducted from CI/CD (user can preform commit / or / or discard changes instead of CI)

iterative / cml.dev

When should we update dvc.lock and dvc push in GitHub Actions? #393

Discussed in https://github.com/iterative/dvc/discussions/6542