iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

dvc pull --glob tries to pull dvcignored files and glob pattern is only applied to files that already exist #5864

Open kevinhaybach opened 3 years ago

kevinhaybach commented 3 years ago

As mentioned in topic I have found two issues

  1. dvc pull --glob ".*json" tries to pull dvcignored files
  2. dvc pull --glob "*.json" ignores the pattern and pulls everything

Explaination 1: Lets assume I have a folder with two json files: file_1.json and file_2.json file_1.json is in dvcignore

I run the command "dvc pull --glob "*.json"

The return is: "ERROR: failed to pull data from the cloud - 'file_1.json' does not exist as an output or a stage name in 'dvc.yaml': 'dvc.yaml' does not exist"

Explaination 2: Lets assume I have a folder with one json file and one png file but this files but the files are NOT in the current directory yet but only the .dvc files

I run the command "dvc pull --glob "*.json"

I would expect hat only the .json file is pulled but instead the .png file is pulled as well.

I got following explaination: "The problem is that the glob pattern is only applied to files that are already in your local workspace (so it works for pulling updated versions but not for pulling new files). if you haven't pulled anything yet, it will return an empty list of pattern matches, and then DVC falls back to the default "pull everything" behavior"

pmrowla commented 3 years ago

discord context: https://discord.com/channels/485586884165107732/485596304961962003/834702940912222238 https://discord.com/channels/485586884165107732/485596304961962003/834707472572481586

skshetry commented 3 years ago

After https://github.com/iterative/dvc/pull/5273 is merged, we'll be able to glob actual outputs instead of the workspace.

pared commented 3 years ago

5273 Finally was not merged, so it seems this issue remains.