iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.79k stars 1.18k forks source link

cloud versioning: multiple remotes/metadata format #8356

Closed dberenbaum closed 1 year ago

dberenbaum commented 2 years ago

Part of #7995

Description

Cloud versioning does not support multiple remotes.

Reproduction

You can reproduce with the script below. Replace CLOUD_REMOTE_1 and CLOUD_REMOTE_2 with your own s3 paths and configure your credentials.

BUCKET=mybucket
REMOTE_1=remote1
REMOTE_2=remote2

export AWS_PROFILE=iterative-sandbox

echo "Get repo."
rm -rf repo
git init repo
cd repo
dvc init

echo "Add two cloud-versioned DVC remotes."
dvc remote add -d cloud-1 s3://$BUCKET/$REMOTE_1
dvc remote modify cloud-1 version_aware true
dvc remote modify cloud-1 worktree true
dvc remote add cloud-2 s3://$BUCKET/$REMOTE_2
dvc remote modify cloud-2 version_aware true
dvc remote modify cloud-2 worktree true
git add .
git commit -m "initialized repo"

echo "Add data"
mkdir data
echo image1 > data/image1.png
echo image2 > data/image2.png
echo model > model.h5
dvc add data
dvc add model.h5
git add .
git commit -m "add data"

echo "Push data to default remote"
dvc push
git --no-pager diff
git commit -am "push data to default remote"

echo "Push data to other remote"
dvc push -r cloud-2
git --no-pager diff # Problem 1: overwrites all version_ids
git commit -am "push data to other remote"

echo "Push to different remotes per output"
echo "  remote: cloud-2" >> model.h5.dvc
dvc push
git --no-pager diff # Problem 2: pushes model.h5 to the default remote

echo "See model.h5 versions on remote 1"
aws s3api list-object-versions --bucket $BUCKET --prefix $REMOTE_1/model.h5
echo "See model.h5 versions on remote 2"
aws s3api list-object-versions --bucket $BUCKET --prefix $REMOTE_2/model.h5

Expected

  1. DVC should keep track of the remote for each version_id, so that when pushing to a different remote, it appends a version_id for the other remote instead of overwriting it.
  2. When using the remote-per-output remote: syntax in a .dvc file, DVC should push to that remote instead of the default.
pmrowla commented 2 years ago

Is the push/pull --remote use case actually something that we want to support? I can see the value in supporting multiple remotes per-repo via the explicit remote: per output syntax, and in this case we aren't actually supporting multiple remotes per output.

Supporting multiple remotes per-output leads to questionable behavior unless the user is always pushing everything to all of the remotes they intend to use, every time they make a modification to data in their DVC repo.

pmrowla commented 2 years ago

This is similar to the problem in https://github.com/iterative/dvc/issues/8298, where the existing behavior for push/pull --remote is actually pretty vague as to what the expected behavior (and intended use case) for using multiple remotes with DVC should actually be.

I think there's mainly 2 common use cases for having multiple remotes in DVC:

In practice, the data segregation use case only works in DVC if the user is already using remote: in dvc.yaml/.dvc files to specify one remote per output, or if the user is manually pushing/pulling specific files and dirs each time with a specific --remote

dvc push --remote some-remote path/to/explicit/path

In this case, they are actually still only pushing each output to a single remote (per output), and should probably be using remote: in their dvc.yaml/.dvc file instead of using --remote in the CLI.


For the read-only vs read/write use case, it works in non-cloud versioning scenarios since you can do things like set up an S3 remote with write credentials, and then use the public https://... URL for that bucket as a default read-only remote.

For DVC's purposes, this is not possible to do in a cloud versioning scenario, unless your default read-only remote is still the s3://... bucket URL, but with actual S3 credentials that allow public reads. You cannot just point DVC to the https:// bucket URL to get a read-only remote in this scenario. What you would need to is to dvc push everything to your https:// remote, and get versioned URLs for that remote, which isn't possible (since you can't push to a public https: S3 URL)

dberenbaum commented 2 years ago

In practice, the data segregation use case only works in DVC if the user is already using remote: in dvc.yaml/.dvc files to specify one remote per output, or if the user is manually pushing/pulling specific files and dirs each time with a specific --remote

True, so for this use case, it may not be needed to append multiple version_id values for a single output. However, right now the remote: field isn't respected at all for cloud versioning AFAICT.

For DVC's purposes, this is not possible to do in a cloud versioning scenario, unless your default read-only remote is still the s3://... bucket URL, but with actual S3 credentials that allow public reads.

I think users may want private but still read-only cloud remotes (the bucket policy grants the user read permissions but not write permissions).


A 3rd use case is migrating remotes (let's say a company switches cloud providers). Without keeping track of the remote that generated each version_id, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.

This info should probably be separate from the existing remote: field in .dvc/dvc.yaml (it belongs in dvc.lock and not dvc.yaml). We need to differentiate between configuration/declaration (defining which remote to use for push/pull operations) and state (where the output has been saved and what is its version ID there).

dberenbaum commented 1 year ago

Moving this to p1 since it will be a breaking change and keeps coming up

pmrowla commented 1 year ago

see also: https://github.com/iterative/dvc/issues/8653#issuecomment-1336590003

dberenbaum commented 1 year ago

Moving this to p0 since it blocks any public release of cloud versioning

skshetry commented 1 year ago

Thinking more about this, I think it would be better to only support a single remote for cloud versioning (unless remote: field is used). Maybe we should even not support dvc push -r <remote>, and always expect a default remote to set for cloud versioning?

Since versioning is happening in cloud and not dvc, I think it makes it easier if it were tied to a single remote. Still working on this, but I am not very convinced by above reasons for the need. :)

pmrowla commented 1 year ago

I agree that it's questionable whether or not we should support pushing a single DVC out to more than one remote, but I think we do need to at least support multiple (non-default) remotes in a DVC repo. This could be done via the existing remote: flag per-output (see my earlier comment https://github.com/iterative/dvc/issues/8356#issuecomment-1260567812)

pmrowla commented 1 year ago

A 3rd use case is migrating remotes (let's say a company switches cloud providers).

@dberenbaum the thing to remember is that we can't support migrating cloud versioned remotes the same way that we can with regular remotes. For cloud versioning users will have to maintain the existing remote for as long as they want to access to the data versioned by older git commits. We can technically push historical data to a new remote as object versions in the new remote, but we cannot actually update the old .dvc or dvc.lock files (unless we rewrite git history to replace/update every historical git commit with new .dvc/dvc.lock files). So even if we supported pushing old data to a new remote, there would be no way for the user to actually access it afterwards.

Supporting multiple remotes would really only work in a workflow where the user is explicitly pushing a file to multiple remotes all at once, and then git commits the .dvc/dvc.lock file after pushing to the multiple locations (and I'm not sure this is a useful/needed scenario at all?)

Without keeping track of the remote that generated each version_id, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.

We could work around this by making DVC use the remotes defined in the DVC config for a particular git commit (instead of the current behavior where we use the latest commit's remote and try to get all historical DVC objects from that one remote).

I think users may want private but still read-only cloud remotes (the bucket policy grants the user read permissions but not write permissions).

For regular DVC remotes, you can do things like define the default read-only http remote with a separate s3 bucket for pushing. So both remotes actually point to the same data (via s3/azure/gcs http URLs). But for cloud versioning this scenario doesn't work - this would require doing a dvc push to the "read-only" remote to get the correct version IDs (which is impossible here since you can't push versioned data to an http remote).

For cloud versioning, the only practical way to handle this for versioned remotes would be for users to manage it on the cloud credentials side. So you would only define a single s3 remote, and then manage access through AWS permissions (where some users only have read-only access to a bucket, and some users have write/delete privileges).

pmrowla commented 1 year ago

It seems like maybe we should just always explicitly set remote: on outputs whenever we push to a cloud versioned remote (if the user has not set remote: themselves already), and from then on the given output (and pushed version IDs) would stay tied to that remote.

skshetry commented 1 year ago

We could work around this by making DVC use the remotes defined in the DVC config for a particular git commit (instead of the current behavior where we use the latest commit's remote and try to get all historical DVC objects from that one remote).

Is that how it works right now for cloud versioning? Only the latest config is used?

pmrowla commented 1 year ago

Is that how it works right now for cloud versioning? Only the latest config is used?

For cloud versioning right now we always use either an explicit --remote if it is set via CLI, or the latest config's default remote. (this is the bug @dberenbaum noted earlier where remote: isn't respected for versioning). https://github.com/iterative/dvc/blob/fc1a192c90d7abd174afd4f8329593609c2115d7/dvc/repo/worktree.py#L80-L84 https://github.com/iterative/dvc/blob/fc1a192c90d7abd174afd4f8329593609c2115d7/dvc/repo/worktree.py#L132-L133

We do actually fill the values with the correct remote per-commit when we generate the repo index for a given commit, so fixing this for fetch/pull would really just require fixing the bug so that we use the existing index-filled remote instead of replacing it.

For push we don't support pushing old commits since we can't rewrite git history, but we still have the same bug (with replacing the index-filled remote) to fix there as well. For push this is also a dvc-data index.checkout issue, since checkout only supports checking out to a specific single fs/path. To support respecting remote: we probably need to split push into separate checkout operations (for each remote we are pushing to) and generate the appropriate indexes for each operation.

dberenbaum commented 1 year ago

I agree that handling multiple remotes is not a critical use case. This is a p0 because it defines the metadata structure for cloud versioning. Maybe @pmrowla described the problem better in https://github.com/iterative/dvc-data/issues/241. If you look up through this issue and notice how many times it has already been referenced by other issues in early development of cloud versioning, it might help clarify that having no information about the remote associated with the etag/version_id has quickly become a mess, so we need to keep this info in the .dvc file. Even outside of cloud versioning, imports and other features that reference etags or other external data might benefit from keeping info about their remotes.

Whether to allow for one or multiple remote entries is less critical. It seemed cleaner to me since there is not much to prevent users from configuring multiple remotes and pushing to them, and I don't know how dvc would handle those situations, but it's not a blocker as a product requirement.

A 3rd use case is migrating remotes (let's say a company switches cloud providers). Without keeping track of the remote that generated each version_id, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.

@pmrowla Migrating remote is also not a blocker, but looking at the full comment here about migrating remotes, it still seems like there are benefits to having the remote info saved:

  1. If you change the cloud-versioned remote, dvc will have a record that the version_id from those old commits is associated with the old remote.
  2. The md5 info could still be used to migrate old commits to a cache/non-cloud-versioned remote.
skshetry commented 1 year ago

Do we plan to support content-addressable storage and cloud-versioned storage at the same time? At the moment, we have worktree/version_aware flag at the remote level.

skshetry commented 1 year ago

Also, what should happen if you do dvc push -r <content-addressable-storage>, but you have a remote cloud-versioned-storage specified in one of the output or vice-versa?

pmrowla commented 1 year ago

Since we really only support pushing to a single remote now, should we be setting remote: on these outputs when we push to a cloud versioned remote? You can currently do

$ dvc push -r remote-A some-file
some-file.dvc file now contains
cloud:
  remote-A:
    version_id: ...

$ dvc push -r remote-B some-file
some-file.dvc now contains
cloud:
  remote-B:
    version_id: ...

but I'm not sure it makes sense for us to allow this behavior.

cloud: ... only exists in the serialized .dvcfile right now, and is not accessible from Output instances. For handling dvc update --no-download, and removing support for update --remote ... it would make more sense for us to be setting remote: on each output, and then getting the update from that remote, otherwise we have to re-parse the .dvc file to find the first remote listed in cloud:.

skshetry commented 1 year ago

I think we should just prevent -r <remote> workflow for cloud versioning, and expect a default remote set.

What you are saying makes sense, but I think that's more of a product question on how dvc push -r <remote> should work, whether we should be restricting pushes to only one remote or allow pushing to other remotes too (one of the scenario above was to be able to change remotes).

meta.remote does have remotes set, however.

dberenbaum commented 1 year ago

I think we should just prevent -r <remote> workflow for cloud versioning, and expect a default remote set.

It's fine to drop -r <remote> support (I'm not even sure it's needed without cloud versioning). However, we should respect the remote: field rather than only working with a default remote. There might be a specific remote for a dataset that's on a cloud-versioned remote whereas the rest of the project gets pushed to a traditional remote. My only concern here is that I think there are still some open issues around remote: support.

it would make more sense for us to be setting remote: on each output, and then getting the update from that remote

Sounds good. It probably makes sense to set the top-level remote: field and keep the cloud: field as is (including the remote-A: key). For example, in the case of migration, you could update to remote: remote-B, and we would at least know that the cloud info for remote-A is now invalid.