iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.63k stars 1.18k forks source link

"dvc pull -r REMOTE_A --allow-missing" fails out with multiple remotes #10561

Open spaghevin opened 2 days ago

spaghevin commented 2 days ago

Bug

Description

Hi all, Wanted to preface by thank you for your time! I appreciate DVC and have been using it extensively for work. I am aware of this issue being fixed via the pr #10365. However, I believe I am still seeing the unfixed behavior.

I want to expand my DVC implementation so that instead of just one remote repo dataset, I now have 2. And I want to be able to push and pull to each remote dataset as specified, without operating on the other. For example. I want to pull a specified remote dataset, without pulling the other.

Reproduce

1) Initalizing DVC Initialize dvc in the parent directory. Currently looks like this:

testing-pipeline
     -testing-data
              -dataset_1
                      -data.txt
              -dataset_2
                      -data.txt
              -dataset_1.dvc
              -dataset_2.dvc
     -config

Config looks like


['remote "dataset_1"']

url = s3://uri_1

['remote "dataset_2"']

url = s3://uri_2

2) Initalizing Dataset I have tried uploading all of dataset_1, then clearing the cache, then adding and uploading all of dataset_2 (so that only dataset_2 data is added). So

dvc add dataset_1
dvc remote add dataset_1 s3://uri_1
dvc push -r dataset_1

*clear cache*

dvc add dataset_2
dvc remote add dataset_2 s3://uri_2
dvc push -r dataset_2

3) Pulling Data - Fail Point I want to pull down dataset_2 without pulling dataset_1.

dvc pull -r dataset_2 -–allow-missing

Error(s):

1 file modified
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: 2fe5d0783fc946c99f67d195c23f1894.dir [MD5 hash for dataset_1]                                                                                                                                               
Fetching
Building workspace index                                                                                                                                       |6.00 [00:00,  820entry/s]
Comparing indexes                                                                                                                                              |6.00 [00:00, 7.25entry/s]
Applying changes                                                                                                                                               |1.00 [00:00,   183file/s]

M       dataset_2

1 file modified
ERROR: failed to pull data from the cloud - Checkout failed for following targets:

Dataset_1

Is your cache up to date?

and sometimes

ERROR: failed to pull data from the cloud - Can't remove the following unsaved files without confirmation. Use `--force` to force.

Dataset_1

Expected

I want it so that whenever I do a

dvc pull -r dataset_2 -–allow-missing

, it will search for and download just dataset_2 and not dataset_1. Same thing applies vice versa. Here, it is still searching for the other dataset's md5 hash and data.

Environment information

DVC version: 3.48.4 (pip)

-------------------------

Platform: Python 3.10.13 on macOS-14.6.1-arm64-arm-64bit

Subprojects:

        dvc_data = 3.14.1

        dvc_objects = 5.1.0

        dvc_render = 1.0.2

        dvc_task = 0.4.0

        scmrepo = 3.3.7

Supports:

        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),

        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),

        s3 (s3fs = 2024.9.0, boto3 = 1.35.16)

Config:

        Global: /Users/kevin.a.liu/Library/Application Support/dvc

        System: /Library/Application Support/dvc

Cache types: reflink, hardlink, symlink

Cache directory: apfs on /dev/disk3s3s1

Caches: local

Remotes: s3, s3

Workspace directory: apfs on /dev/disk3s3s1

Repo: dvc (subdir), git

Repo.site_cache_dir: /Library/Caches/dvc/repo/e0ca0e896e992f768511dc35b6549560

Additional Information (if any):

Is there a way for me to do this? Or do I need to set up a subdir dvc file for each dataset? Or is there implementations with dvc.yaml allowing me to do this? I am assuming it is getting through each remote repo and doing a pull, so how do I just isolate it to the remote repo I have specified? Thanks again!

spaghevin commented 2 days ago

Is it possible my version "DVC version: 3.48.4" didn't have the data sync updates from #10365 , or if I am doing the incorrect way of going about multiple remotes?

shcheklein commented 2 days ago

@spaghevin could you try the latest version?

spaghevin commented 1 day ago

@shcheklein Sure thing, and thanks for the response and time! It is greatly appreciated.

Unfortunately, I still get unexpected behavior, and also a new error. The unexpected behavior being that it seems to pull but gives a warning about missing md5 hashes for the other dataset, and it still merges in the other datasets data.

The new error was obtained by:


*Emptying out the dataset folders (deleting the single file I was dvc tracking inside of the folder).

Then

Dvc pull -r remote_dataset_2

Collecting                                                                                                                              |5.00 [00:01, 4.32entry/s]

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:

md5: 194577a7e20bdcc7afbb718f502c134c                                                                                                                            

md5: 7ab69e49282ec876429d9d76f90b00ae.dir [DVC hash for dataset_1]

md5: 603d3d7a58f314ffc93f5a92371b14e5

Fetching

Building workspace index                                                                                                                |6.00 [00:00,  494entry/s]

Comparing indexes                                                                                                                      |6.00 [00:00, 1.17kentry/s]

ERROR: failed to pull data from the cloud - Can't remove the following unsaved files without confirmation. Use `--force` to force.

~path/parent_folder/dataset_1/.DS_Store

I think the actual fetching part of the dvc pull -r still works, as it only fetches the appropriate data into cache. However, once local in cache, the pull then seems to want to merge in all data still.

For example, if my cache is up to date with both dataset_1 and dataset_2 downloads, but their actual respective local folders are empty, if I do a dvc pull -r dataset_1, it will automatically insert both dataset_1 and dataset_2. Then, when my cache is empty (manual delete/fresh start), and my actual respective data folders are empty, when I do a dvc pull -r dataset_1, it will throw a warning that I am missing md5 hash data for dataset_2, but will successfully pull in dataset_1 (but off a manual delete I get this new error? I don’t really know what the error above is but that’s my first time seeing it and possibly from me not deleting the data properly?)

I don’t know if this is the expected result. My thought would be that dvc pull -r dataset_1 would only fetch and then merge in dataset_1 data. Currently it seems to be fetching only dataset_1 data but then trying to look for and merge all data.