iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.86k stars 1.19k forks source link

new command: put-url OR rsync/rclone #6485

Open casperdcl opened 3 years ago

casperdcl commented 3 years ago

Summary

An upload equivalent of dvc get-url.

We currently use get-url as a cross-platform replacement for wget. However, together with get-url, put-url will turn DVC into a replacement for rsync/rclone.

Motivation

Detailed Design

usage: dvc put-url [-h] [-q | -v] [-j <number>] url targets [targets ...]

Upload or copy files to URL.
Documentation: <https://man.dvc.org/put-url>

positional arguments:
  url                   Destination path to put data to.
                        See `dvc import-url -h` for full list of supported
                        URLs.
  targets               Files/directories to upload.

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  -j <number>, --jobs <number>
                        Number of jobs to run simultaneously. The default
                        value is 4 * cpu_count(). For SSH remotes, the default
                        is 4.

How We Teach This

Drawbacks

Alternatives

Unresolved Questions

Please do assign me if happy with the proposal.

(dvc get-url + put-url = dvc rsync :))

dberenbaum commented 3 years ago

(dvc get-url + put-url = dvc rsync :))

Does it make sense to have a separate put-url command strictly for uploads, or would it be better to have a combined transfer command? This has both UI and usage implications, and a combined transfer command could enable:

* Python API (`dvc.api.put_url()`)?

dvc get-url and dvc.api.get_url() don't really do the same thing unfortunately, so I'm unclear what dvc.api.put_url() would do? That confusion might be another reason to prefer a two-way transfer command over separate download and upload commands.

casperdcl commented 3 years ago

dvc get-url and dvc.api.get_url() don't really do the same thing

:eyes: well that sounds like a problem

would it be better to have a combined transfer command?

Maybe this wouldn't be too difficult to implement. Summarising what I think the most useful rsync options may be:

dvc rsync [-h] [-q | -v] [-j <number>] [--recursive] [--ignore-existing]
          [--remove-source-files] [--include <pattern>] [--exclude <pattern>]
          [--list-only] src [src...] dst

We've endeavoured to provide buffered read/write methods for all remote types so that we can show progress... So simply gluing said BufferedIO methods together should provide the desired functionality.

dberenbaum commented 3 years ago

👀 well that sounds like a problem

😬 Yup, I only realized this pretty recently myself. Extracted to #6494.

dmpetrov commented 2 years ago

The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc file) to it for future downloads. This command (and its download equivalent) are supposed to work from no-dvc repositories - it means, no dependencies to .dvc/ dir.

It seems like, we need an "upload equivalent" of import-url, not get-url.

Proposed names: dvc export url [--no-exec] FILE URL and renaming some of the existing commands:

--no-exec option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file. In the case of downloading, the pointer file should have empty checksums.

New commands:

PS: it is related to the idea of lightweight model and data management

casperdcl commented 2 years ago

The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc file) [...] It seems like, we need an "upload equivalent" of import-url, not get-url.

export url sounds like a different feature request & use case to me. put url meanwhile is meant to work like get url, i.e. no pointer files/metadata.

I also think put url is a prerequisite to export url just like get url is a prerequisite to import url.

dmpetrov commented 2 years ago

put url is a good idea and I hope it will be implemented as a part of this initiative. But we should understand that this is not a part of the requirements.

I'd suggest focusing on the core scenario - export url - that is required for CML and MLEM.

shcheklein commented 2 years ago

the simplest way to upload a file and preserve a pointer file to it

does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?

It sounds then than this should be an extension to import itself (move a file before import, or something). I don't feel that this deservers the full name "export" since it won't be doing anything different from import-url.

casperdcl commented 2 years ago

My original intention with this feature request was just the first bit (aws s3 cp local_path_in_repo s3://remote_path, or even just aws s3 cp local_path_NOT_in_repo s3://remote_path). I don't have any strong opinions about the other import/export functionality & API that can be built on top.

shcheklein commented 2 years ago

My original intention with this feature request was just the first bit

Yep, and name put makes sense for that.

dmpetrov commented 2 years ago

My original intention with this feature request was just the first bit

Yep, and name put makes sense for that.

I thought that we are solving CML, MLEM problems with this proposal, aren't we? If not, I'm creating a separate issue for the integrations and we can keep the current one as is.

shcheklein commented 2 years ago

If not, I'm creating a separate issue for the integrations and we can keep the current one as is.

I'm fine with this one :)

My questions stays - it doesn't feel like it's export. It does copy + import under the hood, right? So why export then? Why not an option for the import command (to copy artifact to the cloud first)?

dmpetrov commented 2 years ago

@shcheklein yes, it can be just an option of import like dvc import url/model/data --upload.

import/export naming is definitely not perfect. So, an option might be a safer choice in the short term.

dberenbaum commented 2 years ago

renaming some of the existing commands

See https://github.com/iterative/dvc/issues/6494#issuecomment-906600254 for a proposed syntax to cover both dvc urls and arbitrary urls with one command.

--no-exec option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file.

--no-exec exists today in import and import-url to create the .dvc file (with the hash) without downloading. Seems like we need separate flags to skip the check for existence (and adding the hash) and the download.

I'd suggest focusing on the core scenario - export url - that is required for CML and MLEM.

My original intention with this feature request was just the first bit (aws s3 cp local_path_in_repo s3://remote_path, or even just aws s3 cp local_path_NOT_in_repo s3://remote_path). I don't have any strong opinions about the other import/export functionality & API that can be built on top.

Seems like it's unclear whether CML needs put, export, or both. What are the CML use cases for each?

does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?

Hm, I thought export would differ from import in that updates would always be uploaded from local to remote (instead of downloading from remote to local). Example workflow:

shcheklein commented 2 years ago

Hm, I thought export would differ from import in that updates would always be uploaded from local to remote

It might make sense to name it export then. That's why I was asking about the semantics of it. From the previous discussions (Dmitry's and Casper's) I didn't get it exactly.

dmpetrov commented 2 years ago

I'm trying to aggregate our discussions here and in person to action points:

  1. [Must-have] dvc export that should upload a local file to a cloud and preserve a link (.dvc file) similar to result of dvc import-url.
  2. [Nice-to-have] dvc put-url. It is not a part of use cases (see below) but something like this needs to work under the hood of dvc export anyway. And it might be handy for other scenarios.
  3. [Nice-to-have] dvc import-url ---etags-only (--no-exec but it gets etags from cloud) and/or dvc update --etags-only. This is needed to track file statuses when file is not downloaded locally.

Important:

Below are user use cases that should help to understand the scenarios.

From local to Cloud/S3

A model out/model.h5 is saved in a local directory: local machine or cloud/TPI or CML, it might be DVC/Git or just a directory like ~/. The model needs to be uploaded to a specified place/url in a cloud/S3. User needs to keep the pointer file (.dvc) for future use.

Why user needs the pointer file:

Uploading

$ dvc export out/model.h5 s3://mybucket/ml/prod/my-model.h5
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Note, This command is an equivalent to aws s3 cp file s3://path && dvc import-url s3://path file. We can consider introducing a separate command to cover the copy part in cross-cloud way - dvc put-url. However, the priority is not high in the context of the scenario.

Updating

A model file was changed (as a result of re-training) for example:

$ dvc update out/model.h5.dvc # It should work now if the Uploading part is based on `import-url`
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'File was changed in S3'

From cloud to workspace

Users write models/data to cloud from user's code (or it is already updated by an external tool). Saving pointer to a model file still might be useful. Why:

Tracking a cloud file

After training is done and a file saved to s3://mybucket/ml/prod/2022-03-07-model.h5:

$ dvc import-url s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Tracking a cloud file without a local copy

In some cases, user does writes a file in a storage and does not need a copy in workspace. dvc import-url --no-exec seems like a good option to cover this case.

$ dvc import-url --no-exec s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:

    git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'

Technically, the file will still have a virtual representation in the workspace as my-model.h5. However, it won't be materialized until dvc update my-model.h5.dvc is called.

Pros/Cons:

To cover the latest cons, we can consider introducing dvc import-url ---etags-only (--no-exec but get etags from cloud) and/or dvc update --etags-only.

shcheklein commented 2 years ago

@dmpetrov

Could you please clarify this please:

"It should work now if the Uploading part is based on import-url" - just expand on this a bit. I'm not sure I understand what direction files go when you do dvc update


My initial reaction is that aws s3 cp file s3://path && dvc import-url s3://path file semantics doesn't deserve a global dvc export command to be honest. It still feels very much like import, not export. Since we'll have pretty much an import .dvc file in the repo that detects changes outside and imports a file from a cloud.

External outputs remind export to me. dvc run -d model.pkl -o s3://model.pkl aws s3 cp model.pkl s3://model.pkl. It means that every time model changes in the repo it's being exported to S3.

dmpetrov commented 2 years ago

Could you please clarify this please:

"It should work now if the Uploading part is based on import-url

dvc update re-downloads file. What I mean - a regular dvc update out/model.h5.dvc will work just fine if the result of dvc export is the same as dvc import-url (in contrast to external outputs when you need to re-run pipeline).

The logic is:

To be honest I'd rename the first two to download, upload. If we mixing up the direction then user will have similar issues.

My initial reaction is that aws s3 cp file s3://path && dvc import-url s3://path file semantics doesn't deserve a global dvc export command to be honest.

aws s3 cp is not an option here because we need to abstract out from clouds. Alternatively, we can consider dvc put-url file s3://path && dvc import-url s3://path file but having a single command still looks a better option.

External outputs remind export to me.

Yes, but internal machinery and logic is very different. You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.

shcheklein commented 2 years ago

the result of dvc export is the same as dvc import-url,

that's exactly the sign that we are mixing the semantics

You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.

not necessarily btw, dvc add --external s3://mybucket/existing-data works (at least it worked before)

aws s3 cp is not an option here because we need to abstract out from clouds

Yep, I understand. It's not so much about redundancy of a command, it's more about the semantics still. It confuses me a bit that export internally does import.

For example, we can make dvc export that creates a .dvc file with a single dependency on a model.pkl and an external output to s3://model.pkl . Something like the result of dvc add --external s3://mybucket/existing-data but that also saves information (if it's needed about the local file name that was the source).

And dvc update on this file would work other way around - it would be uploading the file to s3 (exporting).

but having a single command still looks a better option.

If we want to keep this semantics (import link created inside export), I would probably even prefer to have put-url and do import-url manually. It would be less confusing and very explicit to my mind.


Also, if we go back to the From local to Cloud/S3 workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it? Means that dvc update should be uploading the new file in this case. At least that's the way I'm reading this.

dmpetrov commented 2 years ago

And dvc update on this file would work other way around - it would be uploading the file to s3 (exporting).

This looks like the direction of upload is your major concern. Is it correct?

Also, if we go back to the From local to Cloud/S3 workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it?

It means the upload is happening as a result of dvc export. It is decoupled from training and you suppose to re-upload the file by dvc commands. In this case, changing the direction of dvc update might be a better choice from workflow point of view.

dberenbaum commented 2 years ago

From local to Cloud/S3

In this scenario, the user has their own local model.h5 file already. It may or may not be tracked by DVC. If it is tracked by DVC, it might be tracked in model.h5.dvc or within dvc.lock (if it's generated by a DVC stage).

If they want to upload to the cloud and keep a pointer locally, dvc export can be equivalent to dvc run --external -n upload_data -d model.h5 -o s3://testproject/model.h5 aws s3 cp model.h5 s3://testproject/model.h5. This is the inverse of import-url, as shown in the example in https://dvc.org/doc/command-reference/import-url#description.

As @shcheklein noted, the workflow here assumes the user saves updates locally, so it makes sense for update to go in the upload direction and enforce a canonical workflow of save locally -> upload new version.

Similar to how import-url records the external path as a dependency and the local path as an output, export can record the local path as a dependency and the local path as an output. Since a model.h5.dvc file may already exist from a previous dvc add (with model.h5 as an output), it might make more sense to save the export info with some other file extension, like model.h5.export.dvc (this avoids conflicts between the dependencies and outputs of each).

I'll follow up on the other scenarios in another comment to keep this from being too convoluted 😅

Edit: On second thought, maybe it's better to resolve this scenario first 😄 . The others might require a separate discussion.

dmpetrov commented 2 years ago

If we go to the bi-directional dvc uploaddvc update then we are splitting it to two major cases:

  1. Local to Storage. It should be based on external outputs. Similar to dvc run --external -o s3://.... dvc update file.dvc is uploading file to cloud.
    • a. no-DVC file. It straightforward. export command just creates a .dvc file.
    • b. DVC file.
      • Pipeline file (.lock). Q: Should pipeline do an automatic upload if a result of dvc export is in a pipeline? To my mind - it is not necessary since we need to make quite a strong assumption about productization and performance.
      • Data file (.dvc). dvc export should generate a bit different file.external.dvc in addition to file.dvc. Q: it does not seem like a default use case, can we assume that user will do the renaming manually by dvc export -f file.export.dvc?
  2. Storage to local. It should be be based on external dependencies. Similar to dvc import-url s3://.... dvc uploads downloads files from cloud.
    • a. With a local copy. Just a regular dvc import-url file s3://...
    • b. Without a local copy. Similar to dvc import-url --no-exec but better to introduce dvc import-url --etags-only (see above).

@shcheklein @dberenbaum WDYT?

dberenbaum commented 2 years ago

@dmpetrov I think it makes sense. However, I think the "Storage to local" scenarios are a little convoluted.

If model updates are happening external to user code and saved in the cloud, or the user already has a model in the cloud saved previously by their code, import-url makes sense.

If instead they are using dvc to track a model that their user code saves in the cloud, import-url seems awkward because they probably never need a local copy. Even if they use --etags-only, if they use the file downstream, it will need to be downloaded. It's also unintuitive because import-url is intended for downloading dependencies instead of tracking outputs.

An alternative is to change how external outputs work. I put a mini-proposal at the bottom of https://www.notion.so/iterative/Model-Management-in-DVC-af279e36b8be4e929b08df7a491e1a4c. It's still a work in progress, but if you have time, PTAL and see if the direction makes sense.

dmpetrov commented 2 years ago

If instead they are using dvc to track a model that their user code saves in the cloud, import-url seems awkward because they probably never need a local copy.

Right. This does not look like a core scenario.

Just to make it clear - Storage to local covers use cases when a model was created from outside of a repository. Examples: user imports an external model to use GitOps/Model-Registry functionality, importing a pre-trained model or existing dataset.

dberenbaum commented 2 years ago

In your earlier comment, you seemed to indicate that the scope was broader:

From cloud to workspace

Users write models/data to cloud from user's code (or it is already updated by an external tool).

Are we now limiting it to cases where the model was updated by an external tool?

Edit: Or maybe writing models to cloud from user's code is part of "Local to Storage." Either way, I think there's a core scenario for writing models directly to cloud from user's code that isn't covered by export or import-url.

dmpetrov commented 2 years ago

It was described as a broader scenario but the major goal was to cover Lightweight Model Management use case (see "user imports an external model to use GitOps/Model-Registry functionality"). It can be useful in some other scenarios (see "importing a pre-trained model or existing dataset.").

However, importing training model to the same repo does not make sense. We are introducing From local to Cloud for this.

dmpetrov commented 2 years ago

In the context of model management we nailed down the scope - see #7918

bhack commented 1 year ago

Any news on this? I really want to "materialize" a specific commit to a remote cloud bucket without using directly cloud specific cli command line tools

efiop commented 1 year ago

@bhack No progress here, sorry.

dberenbaum commented 1 year ago

@bhack Could you explain more about your scenario? One option might be to push to a cloud-versioned remote, which would show files in a more typical directory structure.

bhack commented 1 year ago

@dberenbaum In some cases I need to use gcsfuse or similar. As we don't have currently a pull --to-remote option we need to locally materialize on host filesystem, with pull, the requested commit and then sync with cloud bucket using native cli or libraries. Materializing multiple commits in parallel also Is not data efficient.

dberenbaum commented 1 year ago

Do you mind pulling up to a higher level? Once you sync with the cloud bucket, how are you using gcsfuse and why do you need this setup? When you materialize multiple commits, do you sync each to a different location?

bhack commented 1 year ago

how are you using gcsfuse and why do you need this setup

As this is a quite common (emerging?) ML setup. See: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/tree/main/examples/pytorch https://cloud.google.com/blog/products/containers-kubernetes/announcing-cloud-storage-fuse-and-gke-csi-driver-for-aiml-workloads

When you materialize multiple commits, do you sync each to a different location?

Yes, I need to use at least a different prefix in the destination bucket to replicate the whole data e.g. with the related dvc/git commit and handle the garbage collection on the materialized commits when are not need anymore.

At least I don't find any dvcoff the shelf solution to efficiently manage this.

dberenbaum commented 1 year ago

Hi @bhack, sorry for the lack of response here. Would you have time to discuss in more depth on a call sometime?

bhack commented 1 year ago

@dberenbaum Yes but I think that it will be more useful to open a new discussion thread on github so that it could be useful also for other users.