iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.67k stars 1.17k forks source link

add support for wildcard/patterns #4816

Open dmpetrov opened 3 years ago

dmpetrov commented 3 years ago

Sometimes only a subset of files is needed when the user runs import or pull data from a data directory. It is convenient to define a file pattern for an import.

From https://discuss.dvc.org/t/working-with-a-small-subset-of-remote-data/541 Related: https://github.com/iterative/dvc/issues/4705, https://github.com/iterative/dvc/issues/4815

Patterns to implement:

The first three patterns should use a regular Unix file syntax. While the last two require a special language to define the pattern - we need to find a good examples.

dmpetrov commented 3 years ago

Based on my experience I'd assign the priorities like this:

  1. simple wildcard *
  2. globstar/ricursive **
  3. data path/file-%Y-%m-%d.txt
  4. iterator/count %C
  5. whole wildcard - ?, ., {}

But we need to agree on the common pattern format (how to reflect the pattern in dvc-files) before implementing even the first step.

efiop commented 3 years ago

Regarding the first step

simple wildcard dvc pull cats-dogs/data/train/dogs/*.img

support for dir entries will simply require treating existing filter_info in https://github.com/iterative/dvc/blob/6a9ab9cdfbf8ddd5ccb647b072cc36955a69a0e1/dvc/output/base.py#L403 appropriately. Right now we only check if filter equals or contains other files.

Regular glob patterns are clearer than the proposed date/counter selectors, those need some research on existing solutions. So this is a multilayer ticket that has a lot of special cases.

karajan1001 commented 3 years ago

Related #4419.

ju0gri commented 3 years ago

I will be taking a stab at implementing the first step for this issue.

jorgeorpinel commented 3 years ago
  • [ ] simple wildcard dvc pull cats-dogs/data/train/dogs/*.img

Sound slike at least this check box could be marked, per #4864?

efiop commented 3 years ago

@jorgeorpinel No, only dvc add supports it right now.

efiop commented 3 years ago

Related https://github.com/iterative/dvc/issues/4912

dmpetrov commented 3 years ago

Sound slike at least this check box could be marked, per #4864?

@jorgeorpinel #4864 is only about dvc add. pull/push/import are missing for checking the first checkbox.

ju0gri commented 3 years ago

I can continue adding this functionality for all commands, if that's alright.

efiop commented 3 years ago

@ju0gri Thanks for looking into it! :pray:

ju0gri commented 3 years ago

In the case of dvc import, what is the desired behaviour for example when importing something like dir/subdir/foo* - should dir/subdir contain one individual entry for each file matching the pattern? Also, when importing only files such as foo* in a passed output folder foos_imported - should this be a folder containing the individual files e.g. foos_imported/foo.dvc, foos_imported/foo123.dvc or should there be an entry for each foo file prefixed with the output value: e.g. foos_imported_foo.dvc, foos_imported_foo123.dvc?

efiop commented 3 years ago

@ju0gri Good question! We could start simple: dvc import and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.

ju0gri commented 3 years ago

@ju0gri Good question! We could start simple: dvc import and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.

Ok, so I was going down the complicated route with the solution for this. Does it still make sense to add the functionality to import in this case? The only benefits i see with this is to simplify typing a long complex filename e.g. foo234783478432hjhfjdfd, and maybe as a building block for future work where import might return a list of stages similar to add.

efiop commented 3 years ago

@ju0gri Yep, still useful.

jorgeorpinel commented 3 years ago

Question:

We've introduced the --glob option to a few commands to implement some of these patterns above (the ones covered by glob i.e. 1,2, and 5 from https://github.com/iterative/dvc/issues/4816#issuecomment-719996406)

Is the option temporary, expecting to make this default the behavior at some point? Otherwise I think we may need a better term as discussed in https://github.com/iterative/dvc/pull/4976#issuecomment-736701953, and even more now that I see patterns 3 (iterator) and 4 (date) which I think aren't covered by "glob".

Thanks

jorgeorpinel commented 3 years ago

Hi, can we include the discussion about wildcards in stage output and dependency definitions (in dvc.yaml and maybe also run/stage add -od)? It's not listed in the check boxes of this issue's description, but it's mentioned in https://github.com/iterative/dvc/issues/1462#issuecomment-450821953 (2.A and B). Or I can make a separate issue for outs/deps.

A couple users have brought up the need for this in https://discuss.dvc.org/t/managing-pipelines-operating-per-dataset-element/613

tibor-mach commented 1 year ago

Seconding @jorgeorpinel on this, there is some new demand for wildcards on dvc stage outputs