Open dmpetrov opened 3 years ago
Based on my experience I'd assign the priorities like this:
*
**
path/file-%Y-%m-%d.txt
%C
?
, .
, {}
But we need to agree on the common pattern format (how to reflect the pattern in dvc-files) before implementing even the first step.
Regarding the first step
simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
support for dir entries will simply require treating existing filter_info
in https://github.com/iterative/dvc/blob/6a9ab9cdfbf8ddd5ccb647b072cc36955a69a0e1/dvc/output/base.py#L403 appropriately. Right now we only check if filter equals or contains other files.
Regular glob patterns are clearer than the proposed date/counter selectors, those need some research on existing solutions. So this is a multilayer ticket that has a lot of special cases.
Related #4419.
I will be taking a stab at implementing the first step for this issue.
- [ ] simple wildcard dvc pull cats-dogs/data/train/dogs/*.img
Sound slike at least this check box could be marked, per #4864?
@jorgeorpinel No, only dvc add
supports it right now.
Sound slike at least this check box could be marked, per #4864?
@jorgeorpinel #4864 is only about dvc add
. pull
/push
/import
are missing for checking the first checkbox.
I can continue adding this functionality for all commands, if that's alright.
@ju0gri Thanks for looking into it! :pray:
In the case of dvc import
, what is the desired behaviour for example when importing something like dir/subdir/foo*
- should dir/subdir
contain one individual entry for each file matching the pattern?
Also, when importing only files such as foo*
in a passed output folder foos_imported
- should this be a folder containing the individual files e.g. foos_imported/foo.dvc
, foos_imported/foo123.dvc
or should there be an entry for each foo
file prefixed with the output value: e.g. foos_imported_foo.dvc
, foos_imported_foo123.dvc
?
@ju0gri Good question! We could start simple: dvc import
and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.
@ju0gri Good question! We could start simple:
dvc import
and its signature only supports one target, so it would be safe to just error-out if after globbing you get more than one target.
Ok, so I was going down the complicated route with the solution for this. Does it still make sense to add the functionality to import
in this case? The only benefits i see with this is to simplify typing a long complex filename e.g. foo234783478432hjhfjdfd
, and maybe as a building block for future work where import might return a list of stages similar to add
.
@ju0gri Yep, still useful.
Question:
We've introduced the --glob
option to a few commands to implement some of these patterns above (the ones covered by glob i.e. 1,2, and 5 from https://github.com/iterative/dvc/issues/4816#issuecomment-719996406)
Is the option temporary, expecting to make this default the behavior at some point? Otherwise I think we may need a better term as discussed in https://github.com/iterative/dvc/pull/4976#issuecomment-736701953, and even more now that I see patterns 3 (iterator) and 4 (date) which I think aren't covered by "glob".
Thanks
Hi, can we include the discussion about wildcards in stage output and dependency definitions (in dvc.yaml
and maybe also run/stage add -od
)? It's not listed in the check boxes of this issue's description, but it's mentioned in https://github.com/iterative/dvc/issues/1462#issuecomment-450821953 (2.A and B). Or I can make a separate issue for outs/deps.
A couple users have brought up the need for this in https://discuss.dvc.org/t/managing-pipelines-operating-per-dataset-element/613
Seconding @jorgeorpinel on this, there is some new demand for wildcards on dvc stage outputs
Sometimes only a subset of files is needed when the user runs
import
orpull
data from a data directory. It is convenient to define a file pattern for an import.From https://discuss.dvc.org/t/working-with-a-small-subset-of-remote-data/541 Related: https://github.com/iterative/dvc/issues/4705, https://github.com/iterative/dvc/issues/4815
Patterns to implement:
dvc pull cats-dogs/data/train/dogs/*.img
dvc pull cats-dogs/data/train/{dogs,cats}/???.img
dvc pull cats-dogs/data/train/**/*.img
dvc pull cats-dogs/data/train/dogs/%C.img?counter=1:100
dvc pull users/%Y/%m/%d/users.csv?startdata=2020-09-01,enddate=now,ignoremissing
The first three patterns should use a regular Unix file syntax. While the last two require a special language to define the pattern - we need to find a good examples.