iterative / cml

♾️ CML - Continuous Machine Learning | CI/CD for ML
http://cml.dev
Apache License 2.0
4.03k stars 340 forks source link

Failed to dvc pull on self-hosted Github runner #1351

Closed haimat closed 1 year ago

haimat commented 1 year ago

Recently we had an issue with our DVC remote storage, which prevented us from pulling some files to a local machine. This issue has been solved, we can now run dvc pull without any errors again. However, we cannot get rid of these error messages in the Github action. When we run dvc pull there we see a bunch of these messages on Github:

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files
WARNING: No file hash info found for ...

However, when I connect to the very same machine where that Github action is executed and run dvc pull manually, then everything works fine. So it seems there must be some "leftover" or something (in the Docker container?) we use to run DVC/CML. How can I find the reasons for these errors on the self-hosted Github runner?

This is how our Github workflow looks like:

name: Run DVC Experiment
on:
  push:
    paths:
      - ".github/workflows/cml-pipeline.yaml"
      - "data/**"
      - "src/dvc/**"
      - "params.yaml"
      - "dvc.*"
permissions:
  contents: write
  id-token: write
  pull-requests: write
jobs:
  train-and-report:
    runs-on: [self-hosted, gpu, cml]
    container:
      image: docker://iterativeai/cml:0-dvc2-base1-gpu
      options: --gpus all --ipc host  
    steps:
      - uses: actions/checkout@v3
      - name: Train YOLO model
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          dvc remote add -d -f aime /data/dvc
          dvc pull
          dvc repro
haimat commented 1 year ago

OK for the records: The problem was the dvc remote add command - that folder was not accessible in the Docker container, I needed to mount it via the Github workflow.