iterative / dvc

🦉 ML Experiments and Data Management with Git

https://dvc.org

Apache License 2.0

13.36k stars 1.16k forks source link

`dvc repro --pull`: invalidates cache #10412

Closed legendof-selda closed 1 month ago

legendof-selda commented 1 month ago

Bug Report

`repro`: `dvc repro --pull` invalidates cache. doing `dvc pull` and then `dvc repro` works.

Description

When I use dvc repro --pull on my dvc pipelines, it invalidates cache for certain stages and runs them again. However, if we do dvc pull and then do dvc repro it respects the cache and skips the stages. This is the behaviour I expect for dvc repro --pull as well.

Also for some reason the hash key of the stage outputs keeps changing for every run as well.

Reproduce

Delete all dvc files in directory.
dvc repro --pull

Expected

dvc repro --pull must behave the same way as if we do dvc pull and then dvc repro.

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.49.0 (pip)
-------------------------
Platform: Python 3.11.3 on Linux-6.5.0-1017-azure-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.1
Supports:
        azure (adlfs = 2024.4.0, knack = 0.11.0, azure-identity = 1.12.0),
        http (aiohttp = 3.9.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.4, aiohttp-retry = 2.8.3)
Config:
        Global: /home/user/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/data_vg-lv_data
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/data_vg-lv_data
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cc226011e8332423634ca4bb8822ef4b

Additional Information (if any):

dvc repro --pull data/dvc.yaml -v: repro_pull_verbose.log

dberenbaum commented 1 month ago

Are you able to provide a reproducible example? I'm not able to reproduce with https://github.com/iterative/example-get-started:

$ dvc repro --pull
WARNING: Failed to pull run cache: run-cache is not supported for http filesystem: https://remote.dvc.org/get-started
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'

Stage 'prepare' is cached - skipping run, checking out outputs

Stage 'featurize' is cached - skipping run, checking out outputs

Stage 'train' is cached - skipping run, checking out outputs

Stage 'evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.

legendof-selda commented 1 month ago

I forgot to upload the dvc.yaml file used

vars:
  - BLOB_URL: "<URL>"
  - data-version.toml
stages:
  download_full_parquet:
    desc: Download the exported parquet files
    params:
      - data/data-version.toml:
          - DATA_DIR
    cmd: >2
      <script to download from $BLOB_URL>
    wdir: ..
    outs:
      - data/full_output:
          persist: true
          cache: true
  download_dbg_parquet:
    desc: Download the exported dbg parquet files
    params:
      - data/data-version.toml:
          - DEBUG_DATA_DIR
    cmd: >2
      <script to download from $BLOB_URL>
    wdir: ..
    outs:
      - data/dbg_output:
          persist: true
          cache: true
  generate_full_json:
    desc: Generate the json intermediaries for the full dataset.
    cmd: python <script to generate json data> data/full_output data/full_json
    wdir: ../
    deps:
      - data/full_output
      - data/cached
      - .hashes/code.hash
      - poetry.lock
    outs:
      - data/full_json:
          cache: true
          persist: false
  generate_dbg_json:
    desc: Generate the json intermediaries for the debug dataset.
    cmd: python <script to generate json data> data/dbg_output data/dbg_json
    wdir: ../
    deps:
      - data/dbg_output
      - data/cached
      - .hashes/code.hash
      - poetry.lock
    outs:
      - data/dbg_json:
          cache: true
          persist: false
  generate_full_hdf5:
    desc: Transform the raw json files into hdf5 files the model can understand.
    cmd: python <script to transform json to hdf5> data/full_json data/full_wholesalers
    wdir: ../
    deps:
      - poetry.lock
      - .hashes/code.hash
      - data/full_json
      - data/fred/state_economics.json
      - data/fred/fred_signals_included.json
      - data/fred/national_economics.json
      - data/gas_price/output/state_gas_price.json
      - data/industry_trend/processed_trend.csv
      - data/state_region_mapping/state_region_mapping.json
    outs:
      - data/full_wholesalers:
          cache: true
          persist: true
  generate_dbg_hdf5:
    desc: Transform the raw debug json files into hdf5 files the model can understand.
    cmd: python <script to transform json to hdf5> data/dbg_json data/dbg_wholesalers
    wdir: ../
    deps:
      - poetry.lock
      - .hashes/code.hash
      - data/dbg_json
      - data/fred/state_economics.json
      - data/fred/fred_signals_included.json
      - data/fred/national_economics.json
      - data/gas_price/output/state_gas_price.json
      - data/industry_trend/processed_trend.csv
      - data/state_region_mapping/state_region_mapping.json
    outs:
      - data/dbg_wholesalers:
          cache: true
          persist: true

legendof-selda commented 1 month ago

I think the issue is perhaps due to persist=true?

I assume on pull, persisted output stages are downloaded. I have seen that stages which are set to persist=true requires that we pull the data first. if the output path content has changed and persist was set to true, dvc doesn't skip the stage. like in the case when I forget to checkout dvc while switching branches.

With --pull does it pull the stages first especially for persisted stages?

dberenbaum commented 1 month ago

Ah, you figured it out. persist=True is the issue. When persist=True, it creates a circular dependency, where the output depends not only on the listed dependencies, but also on the existing output. DVC has no way of tracking the contents of the previous iteration of that output, so it does not save that stage to the run-cache. If anything, I think the bug may be that DVC ever skips a stage with persist=True.

legendof-selda commented 1 month ago

so is this by design? I assume that for dvc repro --pull we need to pull the stages that the workflow depends on? if persist is True, it should checkout the cache that is currently in dvc.lock. This issue doesn't pop up if I did a normal dvc pull and then dvc repro.

dberenbaum commented 1 month ago

so is this by design?

I think so, although I agree it's confusing. There are two ways that dvc can skip a stage:

If the workspace hasn't changed from what's in dvc.lock.
If the stage can be restored from the run-cache (it has been run before and the data is in the cache).

In dvc repro --pull, 1 fails for all your stages because there is nothing in your workspace. This is the difference from doing dvc pull first. At the time dvc checks whether the stage has changed, there is nothing in the workspace yet, so it looks like the data has been deleted.

However, 2 succeeds for all your stages except for the persist one since that one was never saved to the run-cache for the reasons mentioned above.

I think what you want is dvc repro --pull --allow-missing. That will make 1 succeed when the data is missing. Maybe it makes sense for --pull to imply --allow-missing.

legendof-selda commented 1 month ago

isn't --allow-missing for ignoring data doesn't exist in remote cache error?

When it comes to the 2nd case, if the workspace is empty then pull the data for the given hash key. If the workspace has different data in the workspace, either rerun the stage or give an error saying the data is different and we could provide a cmd flag or prompt input to force pull the persisted stages. Similar to dvc pull when we have data which isn't tracked by dvc (like the times when we kill the process in the middle you have dvc changes that aren't tracked).

dberenbaum commented 1 month ago

isn't --allow-missing for ignoring data doesn't exist in remote cache error?

Unfortunately, it is used to mean different things under different commands today. This is not the meaning for repro.

if the workspace is empty then pull the data for the given hash key

This is what --allow-missing does except that it skips the stage entirely and doesn't waste time pulling the data.

iterative / dvc

`dvc repro --pull`: invalidates cache #10412

Bug Report

repro: dvc repro --pull invalidates cache. doing dvc pull and then dvc repro works.

Description

Reproduce

Expected

Environment information

`repro`: `dvc repro --pull` invalidates cache. doing `dvc pull` and then `dvc repro` works.