Open spaghevin opened 2 days ago
Is it possible my version "DVC version: 3.48.4" didn't have the data sync updates from #10365 , or if I am doing the incorrect way of going about multiple remotes?
@spaghevin could you try the latest version?
@shcheklein Sure thing, and thanks for the response and time! It is greatly appreciated.
Unfortunately, I still get unexpected behavior, and also a new error. The unexpected behavior being that it seems to pull but gives a warning about missing md5 hashes for the other dataset, and it still merges in the other datasets data.
The new error was obtained by:
*Emptying out the dataset folders (deleting the single file I was dvc tracking inside of the folder).
Then
Dvc pull -r remote_dataset_2
Collecting |5.00 [00:01, 4.32entry/s]
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: 194577a7e20bdcc7afbb718f502c134c
md5: 7ab69e49282ec876429d9d76f90b00ae.dir [DVC hash for dataset_1]
md5: 603d3d7a58f314ffc93f5a92371b14e5
Fetching
Building workspace index |6.00 [00:00, 494entry/s]
Comparing indexes |6.00 [00:00, 1.17kentry/s]
ERROR: failed to pull data from the cloud - Can't remove the following unsaved files without confirmation. Use `--force` to force.
~path/parent_folder/dataset_1/.DS_Store
I think the actual fetching part of the dvc pull -r
For example, if my cache is up to date with both dataset_1 and dataset_2 downloads, but their actual respective local folders are empty, if I do a dvc pull -r dataset_1, it will automatically insert both dataset_1 and dataset_2. Then, when my cache is empty (manual delete/fresh start), and my actual respective data folders are empty, when I do a dvc pull -r dataset_1, it will throw a warning that I am missing md5 hash data for dataset_2, but will successfully pull in dataset_1 (but off a manual delete I get this new error? I don’t really know what the error above is but that’s my first time seeing it and possibly from me not deleting the data properly?)
I don’t know if this is the expected result. My thought would be that dvc pull -r dataset_1 would only fetch and then merge in dataset_1 data. Currently it seems to be fetching only dataset_1 data but then trying to look for and merge all data.
Bug
Description
Hi all, Wanted to preface by thank you for your time! I appreciate DVC and have been using it extensively for work. I am aware of this issue being fixed via the pr #10365. However, I believe I am still seeing the unfixed behavior.
I want to expand my DVC implementation so that instead of just one remote repo dataset, I now have 2. And I want to be able to push and pull to each remote dataset as specified, without operating on the other. For example. I want to pull a specified remote dataset, without pulling the other.
Reproduce
1) Initalizing DVC Initialize dvc in the parent directory. Currently looks like this:
Config looks like
2) Initalizing Dataset I have tried uploading all of dataset_1, then clearing the cache, then adding and uploading all of dataset_2 (so that only dataset_2 data is added). So
3) Pulling Data - Fail Point I want to pull down dataset_2 without pulling dataset_1.
Error(s):
and sometimes
Expected
I want it so that whenever I do a
, it will search for and download just dataset_2 and not dataset_1. Same thing applies vice versa. Here, it is still searching for the other dataset's md5 hash and data.
Environment information
Additional Information (if any):
Is there a way for me to do this? Or do I need to set up a subdir dvc file for each dataset? Or is there implementations with dvc.yaml allowing me to do this? I am assuming it is getting through each remote repo and doing a pull, so how do I just isolate it to the remote repo I have specified? Thanks again!