Closed dmpetrov closed 4 years ago
@dmpetrov Sounds like some automation helper, that can be worked around by simply reading the dvc file, extracting the checksum and then accessing the remote. Sounds like that wouldn't be a problem for anyone needing this for some automation script, not sure it is useful enough for many users though.
Sounds like some automation helper, that can be worked around by simply reading the dvc file
This feature is required in automation scenarios where people prefer to write bash code. Also, in the case of using revisions, it won't be as simple to implement. A single command - would be really great to have.
In terms of convenience, if you ask a user to do that in a manual way - it requires quite deep DVC knowledge and it is easy to make a mistake. The user should do 4 operations: find the corresponding dvc-file, find a right checksum, find default remote in config, combine them. 2x more complex for old revisions - file copies, need to know get
command.
This command will hide a lot of internals and will simplify the usage. It is about lowering the bar.
This issue has a remote config complication - do we use a remote from config from specified rev or current one? Both might be useful:
I think this already exists in dvc.api.get_url
so may be a matter of wrapping it in a command 🙂 (and figuring the command arguments and their logic, etc.)
Change of plans. Talked with @dmpetrov @shcheklein , and they suggested that --show-url
(or something similar) for dvc get
would be better than introducing a separate command. Let's consider that instead.
Also, what to show for directories?
@efiop, maybe I am missing something, but as per docs, get
command downloads a file/directory from a DVC and Git repository. And, adding another responsibility to get
, i.e. getting path/URL to the files in the cache, which in my view, is completely different from previous responsibility.
@skshetry good point! Let me clarify the way I see it. The reasons behind this decision to piggy-back on dvc get
for this feature is to avoid introducing a new global command (dvc resolve
) + it feels like a "dry-run" dvc get
(show an URL it would download, but do not actually download it), especially considering that they have a common interface - URL + path.
Btw, do you have a better place in mind to implement this?
@shcheklein, my argument is for Unix philosophy, and have a better interface and not become something similar to git checkout
.
git-checkout - Switch branches or restore working tree files
But, my opinion is based on the discussions here on the issue. I'll check the implementation to see what get_url()
really does and will make up my mind.
my argument is for Unix philosophy, and have a better interface and not become something similar to
git checkout
💯, but we should be reasonable about adding new top level commands. Otherwise we will become like git from a different angle - https://git-man-page-generator.lokaltog.net/
I'll check the implementation to see what get_url() really does and will make up my mind.
yes, please take a look. Open to other options.
Will dvc import
also have the same flag? I also think these commands already do too many things tbh. And won't it be confusing that dvc list
will show paths inside the repo url
but when you actually run dvc get/import --show-url
on them, the URL is to the remote instead?
What about adding the option in the new dvc list
command instead? See #2509
This would give people the alternative to download their tools of choice e.g. wget
or aws s3 cp
@jorgeorpinel dvc list
is a good candidate! (we can consider migrating this option from get when it's ready).
OK. Another Q: Will this new option also say whether the file actually exists in that URL? Related to #3155
@jorgeorpinel no, that's out of scope for now, I think.
Yeah makes sense. I'm going to add a note in the dvc.api.get_url
docs to emphasize that the existence of the file is not implied by having the URL and that users should keep that in mind when they use the URL (wrap in try/catch statement). So the same note would apply to this option when we write its docs.
Also, what to show for directories?
Since we are not able to access urls, we can't really parse and show directories. We will only show url to .dir
file that can then be downloaded and parsed by the user.
Also, git files will be handled separately when we add support for them to api itself.
@efiop, opened #3182 for further discussion regarding directories.
I agree with @skshetry here - an option that completely hijacks what the command does is confusing. Are we trying to beat git in bad command line UI? ;)
@Suor mention alternatives? what do you think about other options mentioned above? (please, let's be constructive a little bit 🙏).
@shcheklein dvc show-url <repo-url> <path>
@Suor Suffers from the same issue of dedicating a full command to this functionality, same as dvc resolve
. Let's keep as is for now.
I don't understand why is that the issue.
@Suor every command has a lot of boilerplate (code, docs, etc) that we would need to support - this is not nice. This approach is not scalable in terms of even reading the list of commands (we can't do 100 commands, for example).
@shcheklein, why not trying to reduce the boilerplate instead) ?
why not trying to reduce the boilerplate instead
sure, but it's never zero anyway right? )
I'd like to add another point. If it's just to show URL, it does not make much sense to introduce a new command if it will be going to a new home anyway (see: https://github.com/iterative/dvc/issues/2994#issuecomment-574803781) and it makes more sense there imo.
And, i think, it'll be problematic to deprecate a global command compared to an option?
EDIT: Simplified a bit.
And, i think, it'll be problematic to deprecate a global command compared to an option?
💯
In some cases, users need a direct link to a data file in the cloud. It might be just a matter of convenience or when DVC is used from automation tools like CD4ML scenarios. We need to provide a way to get a direct link.
I suggest creating a new command
dvc resolve
. Other options can be also be considered. For example, it might be a part ofdvc status
(when it start supporting data files as argument).It should support revisions (checksum, tag, branch or HEAD^^):
Note, the commands with
revision
options should resolve old remotes as well (sees3://
andgs://
above).