iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.89k stars 1.18k forks source link

get: show url to cache #2994

Closed dmpetrov closed 4 years ago

dmpetrov commented 4 years ago

In some cases, users need a direct link to a data file in the cloud. It might be just a matter of convenience or when DVC is used from automation tools like CD4ML scenarios. We need to provide a way to get a direct link.

I suggest creating a new command dvc resolve. Other options can be also be considered. For example, it might be a part of dvc status (when it start supporting data files as argument).

$ dvc resolve file.txt df.csv
file.txt s3://dvc-temp/dataset1/2d10f00a05f1fe70eaa3c42aa9f44b95
df.csv s3://dvc-temp/dataset1/99370a5386cac00c93c9fdc836076a7d

It should support revisions (checksum, tag, branch or HEAD^^):

$ dvc resolve file.txt df.csv --rev ada9d973
file.txt s3://dvc-temp/dataset1/d30abac6950583e6a5f1dcd31cb1043b
df.csv gs://project42/cache/2290babda371e52eeca2a2065a358783

Note, the commands with revision options should resolve old remotes as well (see s3:// and gs:// above).

efiop commented 4 years ago

@dmpetrov Sounds like some automation helper, that can be worked around by simply reading the dvc file, extracting the checksum and then accessing the remote. Sounds like that wouldn't be a problem for anyone needing this for some automation script, not sure it is useful enough for many users though.

dmpetrov commented 4 years ago

Sounds like some automation helper, that can be worked around by simply reading the dvc file

This feature is required in automation scenarios where people prefer to write bash code. Also, in the case of using revisions, it won't be as simple to implement. A single command - would be really great to have.

In terms of convenience, if you ask a user to do that in a manual way - it requires quite deep DVC knowledge and it is easy to make a mistake. The user should do 4 operations: find the corresponding dvc-file, find a right checksum, find default remote in config, combine them. 2x more complex for old revisions - file copies, need to know get command.

This command will hide a lot of internals and will simplify the usage. It is about lowering the bar.

Suor commented 4 years ago

This issue has a remote config complication - do we use a remote from config from specified rev or current one? Both might be useful:

jorgeorpinel commented 4 years ago

I think this already exists in dvc.api.get_url so may be a matter of wrapping it in a command 🙂 (and figuring the command arguments and their logic, etc.)

efiop commented 4 years ago

Change of plans. Talked with @dmpetrov @shcheklein , and they suggested that --show-url(or something similar) for dvc get would be better than introducing a separate command. Let's consider that instead.

Also, what to show for directories?

skshetry commented 4 years ago

@efiop, maybe I am missing something, but as per docs, get command downloads a file/directory from a DVC and Git repository. And, adding another responsibility to get, i.e. getting path/URL to the files in the cache, which in my view, is completely different from previous responsibility.

shcheklein commented 4 years ago

@skshetry good point! Let me clarify the way I see it. The reasons behind this decision to piggy-back on dvc get for this feature is to avoid introducing a new global command (dvc resolve) + it feels like a "dry-run" dvc get (show an URL it would download, but do not actually download it), especially considering that they have a common interface - URL + path.

Btw, do you have a better place in mind to implement this?

skshetry commented 4 years ago

@shcheklein, my argument is for Unix philosophy, and have a better interface and not become something similar to git checkout.

git-checkout - Switch branches or restore working tree files

But, my opinion is based on the discussions here on the issue. I'll check the implementation to see what get_url() really does and will make up my mind.

shcheklein commented 4 years ago

my argument is for Unix philosophy, and have a better interface and not become something similar to git checkout

💯, but we should be reasonable about adding new top level commands. Otherwise we will become like git from a different angle - https://git-man-page-generator.lokaltog.net/

I'll check the implementation to see what get_url() really does and will make up my mind.

yes, please take a look. Open to other options.

jorgeorpinel commented 4 years ago

Will dvc import also have the same flag? I also think these commands already do too many things tbh. And won't it be confusing that dvc list will show paths inside the repo url but when you actually run dvc get/import --show-url on them, the URL is to the remote instead?

What about adding the option in the new dvc list command instead? See #2509 This would give people the alternative to download their tools of choice e.g. wget or aws s3 cp

shcheklein commented 4 years ago

@jorgeorpinel dvc list is a good candidate! (we can consider migrating this option from get when it's ready).

jorgeorpinel commented 4 years ago

OK. Another Q: Will this new option also say whether the file actually exists in that URL? Related to #3155

shcheklein commented 4 years ago

@jorgeorpinel no, that's out of scope for now, I think.

jorgeorpinel commented 4 years ago

Yeah makes sense. I'm going to add a note in the dvc.api.get_url docs to emphasize that the existence of the file is not implied by having the URL and that users should keep that in mind when they use the URL (wrap in try/catch statement). So the same note would apply to this option when we write its docs.

efiop commented 4 years ago

Also, what to show for directories?

Since we are not able to access urls, we can't really parse and show directories. We will only show url to .dir file that can then be downloaded and parsed by the user.

Also, git files will be handled separately when we add support for them to api itself.

skshetry commented 4 years ago

@efiop, opened #3182 for further discussion regarding directories.

Suor commented 4 years ago

I agree with @skshetry here - an option that completely hijacks what the command does is confusing. Are we trying to beat git in bad command line UI? ;)

shcheklein commented 4 years ago

@Suor mention alternatives? what do you think about other options mentioned above? (please, let's be constructive a little bit 🙏).

Suor commented 4 years ago

@shcheklein dvc show-url <repo-url> <path>

efiop commented 4 years ago

@Suor Suffers from the same issue of dedicating a full command to this functionality, same as dvc resolve. Let's keep as is for now.

Suor commented 4 years ago

I don't understand why is that the issue.

shcheklein commented 4 years ago

@Suor every command has a lot of boilerplate (code, docs, etc) that we would need to support - this is not nice. This approach is not scalable in terms of even reading the list of commands (we can't do 100 commands, for example).

ghost commented 4 years ago

@shcheklein, why not trying to reduce the boilerplate instead) ?

shcheklein commented 4 years ago

why not trying to reduce the boilerplate instead

sure, but it's never zero anyway right? )

skshetry commented 4 years ago

I'd like to add another point. If it's just to show URL, it does not make much sense to introduce a new command if it will be going to a new home anyway (see: https://github.com/iterative/dvc/issues/2994#issuecomment-574803781) and it makes more sense there imo.

And, i think, it'll be problematic to deprecate a global command compared to an option?

EDIT: Simplified a bit.

shcheklein commented 4 years ago

And, i think, it'll be problematic to deprecate a global command compared to an option?

💯