iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data") both trying to also pull another remote repository "remote_a_model" #10458

Open spaghevin opened 2 weeks ago

spaghevin commented 2 weeks ago

Bug Report

dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data"): both trying to also pull another remote repository "remote_a_model"

Description

Want to preface by saying I followed (and posted) on issues #10365 #2825 #8298 . I really appreciated the strong feedback/advice on those issues! I am now back and with a similar error, and I am unsure if this is a bug report/feature request/lack of my own understanding but I am leaning towards the latter.

Reproduce

I have multiple online AWS s3 buckets I save data and models to. Let's say I have 5 total, and they are named "remote{a-e}{data/model}", where 'a'->'e' represents which model/data pairing it is, and the data/model represents data/model folders respectively.

I am setting up a training script in which I enter a parameter of a-e and it chooses to download that dataset using DVC and then makes a training run. This folder is also moved to a docker image so I set it up with no scm and git tracking. My folder set up for an example is: training-run --.dvc -----.gitignore -----.config --data_a ------.gitignore ------data_folder.dvc --model_a -----.gitignore -----model.pt.dvc --data_b...and so on

dvc config file is: ['remote' "remote_a_data"] URL = s3://a/data ['remote' "remote_a_model"] URL = s3://a/model

data_folder.dvc is: outs: -md5: ~.dir size: ~ nfiles: ~ hash: md5 path: data_folder remote: remote_a_data

model.pt.dvc is: outs: -md5: ~ (no .dir(?)) size: ~ hash: md5 path: model.pt remote: remote_a_model

I then use a python script to download these repositories using: from dvc.repo import Repo repo = Repo(".") repo.pull(remote="remote_a_data" repo.pull(remote="remote_a_model"

and when that didn't work I tried

dvc pull -r "remote_a_data"

and everytime, no matter what order I did the two commands in, the first one would fail. It would download its respective data, but for some reason still try to download the other remote's data, and when it wasn't found in the cache, it would fail.

So if I ran repo.pull(remote="remote_a_model", it would give me: Collecting Fetching Building workspace index Comparing Indexes Applying Changes Traceback... File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout: raise CheckoutError([relpath(out_path) for out_path in failed], stats) dvc.exceptions.CheckoutError: Checkout failed for following targets: data/data_folder Is your cache up to date?

It will still download the model but it would fail and in for some reason trying to download the data folder. If I do the same thing but instead of using API and use dvc pull -r "remote_a_model" it would give me

Collecting Fetching Building workspace index Comparing Indexes Applying Changes A model_a/model.pt 1 file added and 1 file fetched ERROR: failed to pull data from the cloud - Chcekout failed for following targets: data_a/data_folder Is your cache up to date? File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout: raise CheckoutError([relpath(out_path) for out_path in failed], stats) dvc.exceptions.CheckoutError: Checkout failed for following targets: data/data_folder Is your cache up to date?

Expected

I expect to JUST pull in the remote repository I am calling, without checking to see for the other folder.

Environment information

I am manually starting in a new container every time so the cache and tmp folders are never initialized and always empty!

Output of dvc doctor: Platform: Python 3.10.12 on Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Subprojects:

    dvc_data = 3.15.1

    dvc_objects = 5.1.0

    dvc_render = 1.0.2

    dvc_task = 0.4.0

    scmrepo = 3.3.5

Supports:

    http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),

    https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),

    s3 (s3fs = 2024.6.0, boto3 = 1.34.106)

Config:

    Global: /home/{user}/.config/dvc

    System: /etc/xdg/dvc

Cache types: hardlink, symlink

Cache directory: 9p on C:\

Caches: local

Remotes: s3, s3

Workspace directory: 9p on C:\

Repo: dvc (subdir), git

Thanks a lot!

dberenbaum commented 1 week ago

When you run repo.pull(remote="remote_a_data"), DVC will only pull from that remote, but it will still fail for any data that is set to another remote. To get around this, you can do repo.pull(remote="remote_a_data", allow_missing=True). You might also want to keep an eye on the idea here to add an option for data that is skipped by default unless it is explicitly passed as a target. I think it's worth revisiting this whole behavior in a future major release so that DVC doesn't fail for data that is set to a different remote than the one being pulled.