Open casperdcl opened 3 years ago
(
dvc get-url
+put-url
=dvc rsync
:))
Does it make sense to have a separate put-url
command strictly for uploads, or would it be better to have a combined transfer command? This has both UI and usage implications, and a combined transfer command could enable:
dvc get
equivalent functionality where remote data in dvc/git (for example, a data/model registry) can be transferred to cloud destinations without cloning the repo or pulling the data.* Python API (`dvc.api.put_url()`)?
dvc get-url
and dvc.api.get_url()
don't really do the same thing unfortunately, so I'm unclear what dvc.api.put_url()
would do? That confusion might be another reason to prefer a two-way transfer command over separate download and upload commands.
dvc get-url
anddvc.api.get_url()
don't really do the same thing
:eyes: well that sounds like a problem
would it be better to have a combined transfer command?
Maybe this wouldn't be too difficult to implement. Summarising what I think the most useful rsync
options may be:
dvc rsync [-h] [-q | -v] [-j <number>] [--recursive] [--ignore-existing]
[--remove-source-files] [--include <pattern>] [--exclude <pattern>]
[--list-only] src [src...] dst
We've endeavoured to provide buffered read/write
methods for all remote types so that we can show progress... So simply gluing said BufferedIO
methods together should provide the desired functionality.
👀 well that sounds like a problem
😬 Yup, I only realized this pretty recently myself. Extracted to #6494.
The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc
file) to it for future downloads. This command (and its download equivalent) are supposed to work from no-dvc repositories - it means, no dependencies to .dvc/
dir.
It seems like, we need an "upload equivalent" of import-url
, not get-url
.
Proposed names: dvc export url [--no-exec] FILE URL
and renaming some of the existing commands:
dvc import-url
rename to dvc import url
dvc import
rename to dvc import dvc
--no-exec
option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file. In the case of downloading, the pointer file should have empty checksums.
New commands:
dvc export url FILE URL
- the simplest way to upload a file and preserve a pointer file to itdvc export model FILE URL
- with a set of options to specify meta info for the model (description, model type, input/output type, etc)dvc export data FILE URL
- with a set of options to specify meta info for the data (description, data type, column name for structured, classes distribution for unstructured, etc... )dvc export dvc FILE URL
(do we need this one?)dvc import model URL FILE
dvc import data URL FILE
PS: it is related to the idea of lightweight model and data management
The initial motivation for this command was to upload data to storage and preserve a pointer file (*.dvc file) [...] It seems like, we need an "upload equivalent" of import-url, not get-url.
export url
sounds like a different feature request & use case to me. put url
meanwhile is meant to work like get url
, i.e. no pointer files/metadata.
I also think put url
is a prerequisite to export url
just like get url
is a prerequisite to import url
.
put url
is a good idea and I hope it will be implemented as a part of this initiative. But we should understand that this is not a part of the requirements.
I'd suggest focusing on the core scenario - export url
- that is required for CML and MLEM.
the simplest way to upload a file and preserve a pointer file to it
does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?
It sounds then than this should be an extension to import itself (move a file before import, or something). I don't feel that this deservers the full name "export" since it won't be doing anything different from import-url.
My original intention with this feature request was just the first bit (aws s3 cp local_path_in_repo s3://remote_path
, or even just aws s3 cp local_path_NOT_in_repo s3://remote_path
). I don't have any strong opinions about the other import/export functionality & API that can be built on top.
My original intention with this feature request was just the first bit
Yep, and name put
makes sense for that.
My original intention with this feature request was just the first bit
Yep, and name
put
makes sense for that.
I thought that we are solving CML, MLEM problems with this proposal, aren't we? If not, I'm creating a separate issue for the integrations and we can keep the current one as is.
If not, I'm creating a separate issue for the integrations and we can keep the current one as is.
I'm fine with this one :)
My questions stays - it doesn't feel like it's export. It does copy + import under the hood, right? So why export then? Why not an option for the import command (to copy artifact to the cloud first)?
@shcheklein yes, it can be just an option of import like dvc import url/model/data --upload
.
import/export naming is definitely not perfect. So, an option might be a safer choice in the short term.
renaming some of the existing commands
See https://github.com/iterative/dvc/issues/6494#issuecomment-906600254 for a proposed syntax to cover both dvc urls and arbitrary urls with one command.
--no-exec
option is needed for the cases when storage credentials are not set. It means not to upload/download (and not checking for the file existence), only generating a pointer file.
--no-exec
exists today in import
and import-url
to create the .dvc
file (with the hash) without downloading. Seems like we need separate flags to skip the check for existence (and adding the hash) and the download.
I'd suggest focusing on the core scenario -
export url
- that is required for CML and MLEM.My original intention with this feature request was just the first bit (
aws s3 cp local_path_in_repo s3://remote_path
, or even justaws s3 cp local_path_NOT_in_repo s3://remote_path
). I don't have any strong opinions about the other import/export functionality & API that can be built on top.
Seems like it's unclear whether CML needs put
, export
, or both. What are the CML use cases for each?
does it mean "aws s3 cp local_path_in_repo s3://remote_path && dvc import-url s3://remote_path -o local_path_in_repo"?
Hm, I thought export
would differ from import
in that updates would always be uploaded from local to remote (instead of downloading from remote to local). Example workflow:
local/repo/model.h5
as part of model development.dvc export model.h5 s3://delivery_bucket/model.h5
and tells engineers who consume it for deployment without any DVC knowledge.local/repo/model.h5
.dvc update
and notifies engineers that new model is available for deployment at s3://delivery_bucket/model.h5
.Hm, I thought export would differ from import in that updates would always be uploaded from local to remote
It might make sense to name it export
then. That's why I was asking about the semantics of it. From the previous discussions (Dmitry's and Casper's) I didn't get it exactly.
I'm trying to aggregate our discussions here and in person to action points:
dvc export
that should upload a local file to a cloud and preserve a link (.dvc file) similar to result of dvc import-url
.dvc put-url
. It is not a part of use cases (see below) but something like this needs to work under the hood of dvc export
anyway. And it might be handy for other scenarios.dvc import-url ---etags-only
(--no-exec
but it gets etags from cloud) and/or dvc update --etags-only
. This is needed to track file statuses when file is not downloaded locally.Important:
Below are user use cases that should help to understand the scenarios.
A model out/model.h5
is saved in a local directory: local machine or cloud/TPI or CML, it might be DVC/Git or just a directory like ~/
. The model needs to be uploaded to a specified place/url in a cloud/S3. User needs to keep the pointer file (.dvc) for future use.
Why user needs the pointer file:
dvc get
to download the file$ dvc export out/model.h5 s3://mybucket/ml/prod/my-model.h5
To track the changes with git, run:
git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'
Note, This command is an equivalent to aws s3 cp file s3://path && dvc import-url s3://path file
. We can consider introducing a separate command to cover the copy part in cross-cloud way - dvc put-url
. However, the priority is not high in the context of the scenario.
A model file was changed (as a result of re-training) for example:
$ dvc update out/model.h5.dvc # It should work now if the Uploading part is based on `import-url`
To track the changes with git, run:
git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'File was changed in S3'
Users write models/data to cloud from user's code (or it is already updated by an external tool). Saving pointer to a model file still might be useful. Why:
dvc get
to download the fileAfter training is done and a file saved to s3://mybucket/ml/prod/2022-03-07-model.h5:
$ dvc import-url s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:
git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'
In some cases, user does writes a file in a storage and does not need a copy in workspace. dvc import-url --no-exec
seems like a good option to cover this case.
$ dvc import-url --no-exec s3://mybucket/ml/prod/2022-03-07-model.h5 my-model.h5.dvc
To track the changes with git, run:
git add out/model.h5.dvc .gitignore
$ git add out/model.h5.dvc
$ git commit -m 'exporting a file'
Technically, the file will still have a virtual representation in the workspace as my-model.h5
. However, it won't be materialized until dvc update my-model.h5.dvc
is called.
Pros/Cons:
import-url
was called).To cover the latest cons, we can consider introducing dvc import-url ---etags-only
(--no-exec
but get etags from cloud) and/or dvc update --etags-only
.
@dmpetrov
Could you please clarify this please:
"It should work now if the Uploading part is based on import-url
" - just expand on this a bit. I'm not sure I understand what direction files go when you do dvc update
My initial reaction is that aws s3 cp file s3://path && dvc import-url s3://path file
semantics doesn't deserve a global dvc export
command to be honest. It still feels very much like import, not export. Since we'll have pretty much an import .dvc file in the repo that detects changes outside and imports a file from a cloud.
External outputs remind export to me. dvc run -d model.pkl -o s3://model.pkl aws s3 cp model.pkl s3://model.pkl
. It means that every time model changes in the repo it's being exported to S3.
Could you please clarify this please:
"It should work now if the Uploading part is based on
import-url
dvc update
re-downloads file. What I mean - a regular dvc update out/model.h5.dvc
will work just fine if the result of dvc export
is the same as dvc import-url
(in contrast to external outputs when you need to re-run pipeline).
The logic is:
dvc import-url
- downloading an external filedvc export
- uploading a file to an external storagedvc update
- updating/re-downloading an external filedvc status
- check if a local file is synchronized with its external sourceTo be honest I'd rename the first two to download
, upload
. If we mixing up the direction then user will have similar issues.
My initial reaction is that
aws s3 cp file s3://path && dvc import-url s3://path file
semantics doesn't deserve a globaldvc export
command to be honest.
aws s3 cp
is not an option here because we need to abstract out from clouds. Alternatively, we can consider dvc put-url file s3://path && dvc import-url s3://path file
but having a single command still looks a better option.
External outputs remind export to me.
Yes, but internal machinery and logic is very different. You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.
the result of dvc export is the same as dvc import-url,
that's exactly the sign that we are mixing the semantics
You need a pipeline for external outputs which is not compatible with no-DVC requirements and won't be intuitive for users.
not necessarily btw, dvc add --external s3://mybucket/existing-data
works (at least it worked before)
aws s3 cp is not an option here because we need to abstract out from clouds
Yep, I understand. It's not so much about redundancy of a command, it's more about the semantics still. It confuses me a bit that export internally does import.
For example, we can make dvc export
that creates a .dvc file with a single dependency on a model.pkl and an external output to s3://model.pkl . Something like the result of dvc add --external s3://mybucket/existing-data
but that also saves information (if it's needed about the local file name that was the source).
And dvc update
on this file would work other way around - it would be uploading the file to s3 (exporting).
but having a single command still looks a better option.
If we want to keep this semantics (import link created inside export), I would probably even prefer to have put-url
and do import-url
manually. It would be less confusing and very explicit to my mind.
Also, if we go back to the From local to Cloud/S3
workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it? Means that dvc update
should be uploading the new file in this case. At least that's the way I'm reading this.
And
dvc update
on this file would work other way around - it would be uploading the file to s3 (exporting).
This looks like the direction of upload
is your major concern. Is it correct?
Also, if we go back to the
From local to Cloud/S3
workflow. It states that we create model as a local file, it means that update will be also happening locally when we retrain it?
It means the upload is happening as a result of dvc export
. It is decoupled from training and you suppose to re-upload the file by dvc commands. In this case, changing the direction of dvc update
might be a better choice from workflow point of view.
In this scenario, the user has their own local model.h5
file already. It may or may not be tracked by DVC. If it is tracked by DVC, it might be tracked in model.h5.dvc
or within dvc.lock
(if it's generated by a DVC stage).
If they want to upload to the cloud and keep a pointer locally, dvc export
can be equivalent to dvc run --external -n upload_data -d model.h5 -o s3://testproject/model.h5 aws s3 cp model.h5 s3://testproject/model.h5
. This is the inverse of import-url
, as shown in the example in https://dvc.org/doc/command-reference/import-url#description.
As @shcheklein noted, the workflow here assumes the user saves updates locally, so it makes sense for update
to go in the upload direction and enforce a canonical workflow of save locally -> upload new version.
Similar to how import-url
records the external path as a dependency and the local path as an output, export
can record the local path as a dependency and the local path as an output. Since a model.h5.dvc
file may already exist from a previous dvc add
(with model.h5
as an output), it might make more sense to save the export info with some other file extension, like model.h5.export.dvc
(this avoids conflicts between the dependencies and outputs of each).
I'll follow up on the other scenarios in another comment to keep this from being too convoluted 😅
Edit: On second thought, maybe it's better to resolve this scenario first 😄 . The others might require a separate discussion.
If we go to the bi-directional dvc upload
dvc update
then we are splitting it to two major cases:
dvc run --external -o s3://...
. dvc update file.dvc
is uploading file to cloud.
export
command just creates a .dvc file.dvc export
is in a pipeline? To my mind - it is not necessary since we need to make quite a strong assumption about productization and performance.dvc export
should generate a bit different file.external.dvc
in addition to file.dvc
. Q: it does not seem like a default use case, can we assume that user will do the renaming manually by dvc export -f file.export.dvc
?dvc import-url s3://...
. dvc uploads
downloads files from cloud.
dvc import-url file s3://...
dvc import-url --no-exec
but better to introduce dvc import-url --etags-only
(see above).@shcheklein @dberenbaum WDYT?
@dmpetrov I think it makes sense. However, I think the "Storage to local" scenarios are a little convoluted.
If model updates are happening external to user code and saved in the cloud, or the user already has a model in the cloud saved previously by their code, import-url
makes sense.
If instead they are using dvc to track a model that their user code saves in the cloud, import-url
seems awkward because they probably never need a local copy. Even if they use --etags-only
, if they use the file downstream, it will need to be downloaded. It's also unintuitive because import-url
is intended for downloading dependencies instead of tracking outputs.
An alternative is to change how external outputs work. I put a mini-proposal at the bottom of https://www.notion.so/iterative/Model-Management-in-DVC-af279e36b8be4e929b08df7a491e1a4c. It's still a work in progress, but if you have time, PTAL and see if the direction makes sense.
If instead they are using dvc to track a model that their user code saves in the cloud, import-url seems awkward because they probably never need a local copy.
Right. This does not look like a core scenario.
Just to make it clear - Storage to local covers use cases when a model was created from outside of a repository. Examples: user imports an external model to use GitOps/Model-Registry functionality, importing a pre-trained model or existing dataset.
In your earlier comment, you seemed to indicate that the scope was broader:
From cloud to workspace
Users write models/data to cloud from user's code (or it is already updated by an external tool).
Are we now limiting it to cases where the model was updated by an external tool?
Edit: Or maybe writing models to cloud from user's code is part of "Local to Storage." Either way, I think there's a core scenario for writing models directly to cloud from user's code that isn't covered by export
or import-url
.
It was described as a broader scenario but the major goal was to cover Lightweight Model Management use case (see "user imports an external model to use GitOps/Model-Registry functionality"). It can be useful in some other scenarios (see "importing a pre-trained model or existing dataset.").
However, importing training model to the same repo does not make sense. We are introducing From local to Cloud for this.
In the context of model management we nailed down the scope - see #7918
Any news on this? I really want to "materialize" a specific commit to a remote cloud bucket without using directly cloud specific cli command line tools
@bhack No progress here, sorry.
@bhack Could you explain more about your scenario? One option might be to push to a cloud-versioned remote, which would show files in a more typical directory structure.
@dberenbaum In some cases I need to use gcsfuse or similar.
As we don't have currently a pull --to-remote
option we need to locally materialize on host filesystem, with pull, the requested commit and then sync with cloud bucket using native cli or libraries.
Materializing multiple commits in parallel also Is not data efficient.
Do you mind pulling up to a higher level? Once you sync with the cloud bucket, how are you using gcsfuse and why do you need this setup? When you materialize multiple commits, do you sync each to a different location?
how are you using gcsfuse and why do you need this setup
As this is a quite common (emerging?) ML setup. See: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/tree/main/examples/pytorch https://cloud.google.com/blog/products/containers-kubernetes/announcing-cloud-storage-fuse-and-gke-csi-driver-for-aiml-workloads
When you materialize multiple commits, do you sync each to a different location?
Yes, I need to use at least a different prefix in the destination bucket to replicate the whole data e.g. with the related dvc/git commit and handle the garbage collection on the materialized commits when are not need anymore.
At least I don't find any dvc
off the shelf solution to efficiently manage this.
Hi @bhack, sorry for the lack of response here. Would you have time to discuss in more depth on a call sometime?
@dberenbaum Yes but I think that it will be more useful to open a new discussion thread on github so that it could be useful also for other users.
Summary
An upload equivalent of
dvc get-url
.We currently use
get-url
as a cross-platform replacement forwget
. However, together withget-url
,put-url
will turn DVC into a replacement forrsync
/rclone
.Motivation
get-url
so addingput-url
seems natural for the same reasonsput-url
will be used byrsync
/rclone
. What's not to love?Detailed Design
How We Teach This
put-url
seems to be in line with the existingget-url
(vis. HTTPGET
&PUT
)Drawbacks
Alternatives
Unresolved Questions
put-url
)?url targets [targets...]
)?dvc.api.put_url()
)?Please do assign me if happy with the proposal.
(
dvc get-url
+put-url
=dvc rsync
:))