Closed legendof-selda closed 1 month ago
Are you able to provide a reproducible example? I'm not able to reproduce with https://github.com/iterative/example-get-started:
$ dvc repro --pull
WARNING: Failed to pull run cache: run-cache is not supported for http filesystem: https://remote.dvc.org/get-started
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'
Stage 'prepare' is cached - skipping run, checking out outputs
Stage 'featurize' is cached - skipping run, checking out outputs
Stage 'train' is cached - skipping run, checking out outputs
Stage 'evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage.
I forgot to upload the dvc.yaml
file used
vars:
- BLOB_URL: "<URL>"
- data-version.toml
stages:
download_full_parquet:
desc: Download the exported parquet files
params:
- data/data-version.toml:
- DATA_DIR
cmd: >2
<script to download from $BLOB_URL>
wdir: ..
outs:
- data/full_output:
persist: true
cache: true
download_dbg_parquet:
desc: Download the exported dbg parquet files
params:
- data/data-version.toml:
- DEBUG_DATA_DIR
cmd: >2
<script to download from $BLOB_URL>
wdir: ..
outs:
- data/dbg_output:
persist: true
cache: true
generate_full_json:
desc: Generate the json intermediaries for the full dataset.
cmd: python <script to generate json data> data/full_output data/full_json
wdir: ../
deps:
- data/full_output
- data/cached
- .hashes/code.hash
- poetry.lock
outs:
- data/full_json:
cache: true
persist: false
generate_dbg_json:
desc: Generate the json intermediaries for the debug dataset.
cmd: python <script to generate json data> data/dbg_output data/dbg_json
wdir: ../
deps:
- data/dbg_output
- data/cached
- .hashes/code.hash
- poetry.lock
outs:
- data/dbg_json:
cache: true
persist: false
generate_full_hdf5:
desc: Transform the raw json files into hdf5 files the model can understand.
cmd: python <script to transform json to hdf5> data/full_json data/full_wholesalers
wdir: ../
deps:
- poetry.lock
- .hashes/code.hash
- data/full_json
- data/fred/state_economics.json
- data/fred/fred_signals_included.json
- data/fred/national_economics.json
- data/gas_price/output/state_gas_price.json
- data/industry_trend/processed_trend.csv
- data/state_region_mapping/state_region_mapping.json
outs:
- data/full_wholesalers:
cache: true
persist: true
generate_dbg_hdf5:
desc: Transform the raw debug json files into hdf5 files the model can understand.
cmd: python <script to transform json to hdf5> data/dbg_json data/dbg_wholesalers
wdir: ../
deps:
- poetry.lock
- .hashes/code.hash
- data/dbg_json
- data/fred/state_economics.json
- data/fred/fred_signals_included.json
- data/fred/national_economics.json
- data/gas_price/output/state_gas_price.json
- data/industry_trend/processed_trend.csv
- data/state_region_mapping/state_region_mapping.json
outs:
- data/dbg_wholesalers:
cache: true
persist: true
I think the issue is perhaps due to persist=true?
I assume on pull, persisted output stages are downloaded. I have seen that stages which are set to persist=true requires that we pull the data first. if the output path content has changed and persist was set to true, dvc doesn't skip the stage. like in the case when I forget to checkout dvc while switching branches.
With --pull
does it pull the stages first especially for persisted stages?
Ah, you figured it out. persist=True
is the issue. When persist=True
, it creates a circular dependency, where the output depends not only on the listed dependencies, but also on the existing output. DVC has no way of tracking the contents of the previous iteration of that output, so it does not save that stage to the run-cache. If anything, I think the bug may be that DVC ever skips a stage with persist=True
.
so is this by design? I assume that for dvc repro --pull
we need to pull the stages that the workflow depends on? if persist is True, it should checkout the cache that is currently in dvc.lock
. This issue doesn't pop up if I did a normal dvc pull
and then dvc repro
.
so is this by design?
I think so, although I agree it's confusing. There are two ways that dvc can skip a stage:
dvc.lock
.In dvc repro --pull
, 1 fails for all your stages because there is nothing in your workspace. This is the difference from doing dvc pull
first. At the time dvc checks whether the stage has changed, there is nothing in the workspace yet, so it looks like the data has been deleted.
However, 2 succeeds for all your stages except for the persist one since that one was never saved to the run-cache for the reasons mentioned above.
I think what you want is dvc repro --pull --allow-missing
. That will make 1 succeed when the data is missing. Maybe it makes sense for --pull
to imply --allow-missing
.
isn't --allow-missing
for ignoring data doesn't exist in remote cache error?
When it comes to the 2nd case, if the workspace is empty then pull the data for the given hash key. If the workspace has different data in the workspace, either rerun the stage or give an error saying the data is different and we could provide a cmd flag or prompt input to force pull the persisted stages. Similar to dvc pull when we have data which isn't tracked by dvc (like the times when we kill the process in the middle you have dvc changes that aren't tracked).
isn't
--allow-missing
for ignoring data doesn't exist in remote cache error?
Unfortunately, it is used to mean different things under different commands today. This is not the meaning for repro
.
if the workspace is empty then pull the data for the given hash key
This is what --allow-missing
does except that it skips the stage entirely and doesn't waste time pulling the data.
Bug Report
repro
:dvc repro --pull
invalidates cache. doingdvc pull
and thendvc repro
works.Description
When I use
dvc repro --pull
on my dvc pipelines, it invalidates cache for certain stages and runs them again. However, if we dodvc pull
and then dodvc repro
it respects the cache and skips the stages. This is the behaviour I expect fordvc repro --pull
as well.Also for some reason the hash key of the stage outputs keeps changing for every run as well.
Reproduce
dvc repro --pull
Expected
dvc repro --pull
must behave the same way as if we dodvc pull
and thendvc repro
.Environment information
Output of
dvc doctor
:Additional Information (if any):
dvc repro --pull data/dvc.yaml -v
: repro_pull_verbose.log