iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.8k stars 1.18k forks source link

Pipeline is not executed for parameter with name `size` or `nfiles` #10296

Open shcheklein opened 8 months ago

shcheklein commented 8 months ago

Bug Report

Description

See this link https://stackoverflow.com/questions/77962532/dvc-using-cached-run-although-parameter-changed

Reproduce

Use this repo: https://github.com/shcheklein/test-dvc-so-77962532

Run with size 30, then change to 40, run dvc status, run dvc repro again. It's not running the pipeline, saying this:

Stage 'data_ingestion' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.lock

To enable auto staging, run:

    dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

File size stays the same.

Logs

2024-02-08 20:48:29,997 DEBUG: v3.44.0 (pip), CPython 3.11.4 on macOS-13.3.1-arm64-arm-64bit
2024-02-08 20:48:29,998 DEBUG: command: /Users/ivan/Projects/test-dvc-so/.venv/bin/dvc repro -v
2024-02-08 20:48:30,158 DEBUG: Dependency 'params.yaml' of stage: 'data_ingestion' changed because it is '{'size': 'modified'}'.
2024-02-08 20:48:30,159 DEBUG: stage: 'data_ingestion' changed.
2024-02-08 20:48:30,159 DEBUG: Removing output 'artifacts/data_ingestion' of stage: 'data_ingestion'.
2024-02-08 20:48:30,160 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/data_ingestion'
2024-02-08 20:48:30,161 DEBUG: {}
2024-02-08 20:48:30,161 DEBUG: defaultdict(<class 'dict'>, {'params.yaml': {'size': 'modified'}})
Stage 'data_ingestion' is cached - skipping run, checking out outputs
2024-02-08 20:48:30,163 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/.COXZdYuRz3gn4oeArpSdWQ.tmp'
2024-02-08 20:48:30,164 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/.COXZdYuRz3gn4oeArpSdWQ.tmp'
2024-02-08 20:48:30,164 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/.dvc/cache/files/md5/.wnNey-IUBNjTwgwkjkVJoQ.tmp'
2024-02-08 20:48:30,170 DEBUG: built tree 'object 3d7dd9c155ee06ec6ff8fa04e49f49fe.dir'
2024-02-08 20:48:30,170 DEBUG: Computed stage: 'data_ingestion' md5: '91baabba76b22d5f1480db2cfe105d8b'
2024-02-08 20:48:30,173 DEBUG: built tree 'object 3d7dd9c155ee06ec6ff8fa04e49f49fe.dir'
2024-02-08 20:48:30,173 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/2b21226c06eec22f3477afe4c6de75a80828635723b703713230e4c3c4c39626' to '/Users/ivan/Projects/test-dvc-so/.dvc/cache/files/md5'
2024-02-08 20:48:30,173 DEBUG: Preparing to collect status from '/Users/ivan/Projects/test-dvc-so/.dvc/cache/files/md5'
2024-02-08 20:48:30,173 DEBUG: Collecting status from '/Users/ivan/Projects/test-dvc-so/.dvc/cache/files/md5'
2024-02-08 20:48:30,174 DEBUG: built tree 'object 3d7dd9c155ee06ec6ff8fa04e49f49fe.dir'
2024-02-08 20:48:30,174 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/.z_523r89dhvz_hXD3vW61g.tmp'
2024-02-08 20:48:30,174 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/.z_523r89dhvz_hXD3vW61g.tmp'
2024-02-08 20:48:30,174 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/.dvc/cache/files/md5/.OiTk5AM8wHoSOkDtcj45sA.tmp'
2024-02-08 20:48:30,175 DEBUG: Removing '/Users/ivan/Projects/test-dvc-so/artifacts/data_ingestion/test_data.csv'
2024-02-08 20:48:30,177 DEBUG: stage: 'data_ingestion' was reproduced
Updating lock file 'dvc.lock'

To track the changes with git, run:

    git add dvc.lock

To enable auto staging, run:

    dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
2024-02-08 20:48:30,182 DEBUG: Analytics is enabled.
2024-02-08 20:48:30,222 DEBUG: Trying to spawn ['daemon', 'analytics', '/var/folders/8f/fbysfztx1mb953_gpwl477p80000gn/T/tmpf_cyrru9', '-v']
2024-02-08 20:48:30,226 DEBUG: Spawned ['daemon', 'analytics', '/var/folders/8f/fbysfztx1mb953_gpwl477p80000gn/T/tmpf_cyrru9', '-v'] with pid 6119

Expected

Running the stage.

Environment information

(.venv) √ Projects/test-dvc-so % dvc version
DVC version: 3.44.0 (pip)
-------------------------
Platform: Python 3.11.4 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
    dvc_data = 3.11.0
    dvc_objects = 5.0.0
    dvc_render = 1.0.1
    dvc_task = 0.3.0
    scmrepo = 3.1.0
Supports:
    http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3)
Config:
    Global: /Users/ivan/Library/Application Support/dvc
    System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/4883da32ce8435ea352f10b710b4a968
skshetry commented 8 months ago

Note that this only happens if the param name is size or nfiles. 😅

dberenbaum commented 8 months ago

@skshetry Do you have an idea for a fix? Or do we need to document these as reserved parameter names?

skshetry commented 8 months ago

We are recursively excluding nfiles and size before "hashing" for stage cache, which is incorrect. But I have to think it through what impact this can have. Most likely, we'll be able to remove size and nfiles only from outputs that are not parameter dependencies.

https://github.com/iterative/dvc/blob/953ae56536f03d915f396cd6cafd89aaa54fafc5/dvc/stage/cache.py#L33