Closed dberenbaum closed 1 year ago
Is the push/pull --remote
use case actually something that we want to support? I can see the value in supporting multiple remotes per-repo via the explicit remote:
per output syntax, and in this case we aren't actually supporting multiple remotes per output.
Supporting multiple remotes per-output leads to questionable behavior unless the user is always pushing everything to all of the remotes they intend to use, every time they make a modification to data in their DVC repo.
This is similar to the problem in https://github.com/iterative/dvc/issues/8298, where the existing behavior for push/pull --remote
is actually pretty vague as to what the expected behavior (and intended use case) for using multiple remotes with DVC should actually be.
I think there's mainly 2 common use cases for having multiple remotes in DVC:
Read-only vs read/write access
In practice, the data segregation use case only works in DVC if the user is already using remote:
in dvc.yaml/.dvc files to specify one remote per output, or if the user is manually pushing/pulling specific files and dirs each time with a specific --remote
dvc push --remote some-remote path/to/explicit/path
In this case, they are actually still only pushing each output to a single remote (per output), and should probably be using remote:
in their dvc.yaml/.dvc file instead of using --remote
in the CLI.
For the read-only vs read/write use case, it works in non-cloud versioning scenarios since you can do things like set up an S3 remote with write credentials, and then use the public https://...
URL for that bucket as a default read-only remote.
For DVC's purposes, this is not possible to do in a cloud versioning scenario, unless your default read-only remote is still the s3://...
bucket URL, but with actual S3 credentials that allow public reads. You cannot just point DVC to the https://
bucket URL to get a read-only remote in this scenario. What you would need to is to dvc push
everything to your https://
remote, and get versioned URLs for that remote, which isn't possible (since you can't push to a public https:
S3 URL)
In practice, the data segregation use case only works in DVC if the user is already using
remote:
in dvc.yaml/.dvc files to specify one remote per output, or if the user is manually pushing/pulling specific files and dirs each time with a specific--remote
True, so for this use case, it may not be needed to append multiple version_id
values for a single output. However, right now the remote:
field isn't respected at all for cloud versioning AFAICT.
For DVC's purposes, this is not possible to do in a cloud versioning scenario, unless your default read-only remote is still the
s3://...
bucket URL, but with actual S3 credentials that allow public reads.
I think users may want private but still read-only cloud remotes (the bucket policy grants the user read permissions but not write permissions).
A 3rd use case is migrating remotes (let's say a company switches cloud providers). Without keeping track of the remote that generated each version_id
, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id
values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.
This info should probably be separate from the existing remote:
field in .dvc
/dvc.yaml
(it belongs in dvc.lock
and not dvc.yaml
). We need to differentiate between configuration/declaration (defining which remote to use for push/pull operations) and state (where the output has been saved and what is its version ID there).
Moving this to p1 since it will be a breaking change and keeps coming up
Moving this to p0 since it blocks any public release of cloud versioning
Thinking more about this, I think it would be better to only support a single remote for cloud versioning (unless remote:
field is used). Maybe we should even not support dvc push -r <remote>
, and always expect a default remote to set for cloud versioning?
Since versioning is happening in cloud and not dvc, I think it makes it easier if it were tied to a single remote. Still working on this, but I am not very convinced by above reasons for the need. :)
I agree that it's questionable whether or not we should support pushing a single DVC out to more than one remote, but I think we do need to at least support multiple (non-default) remotes in a DVC repo. This could be done via the existing remote:
flag per-output (see my earlier comment https://github.com/iterative/dvc/issues/8356#issuecomment-1260567812)
A 3rd use case is migrating remotes (let's say a company switches cloud providers).
@dberenbaum the thing to remember is that we can't support migrating cloud versioned remotes the same way that we can with regular remotes. For cloud versioning users will have to maintain the existing remote for as long as they want to access to the data versioned by older git commits. We can technically push historical data to a new remote as object versions in the new remote, but we cannot actually update the old .dvc or dvc.lock files (unless we rewrite git history to replace/update every historical git commit with new .dvc/dvc.lock files). So even if we supported pushing old data to a new remote, there would be no way for the user to actually access it afterwards.
Supporting multiple remotes would really only work in a workflow where the user is explicitly pushing a file to multiple remotes all at once, and then git commit
s the .dvc/dvc.lock file after pushing to the multiple locations (and I'm not sure this is a useful/needed scenario at all?)
Without keeping track of the remote that generated each version_id, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.
We could work around this by making DVC use the remotes defined in the DVC config for a particular git commit (instead of the current behavior where we use the latest commit's remote and try to get all historical DVC objects from that one remote).
I think users may want private but still read-only cloud remotes (the bucket policy grants the user read permissions but not write permissions).
For regular DVC remotes, you can do things like define the default read-only http
remote with a separate s3
bucket for pushing. So both remotes actually point to the same data (via s3/azure/gcs http URLs). But for cloud versioning this scenario doesn't work - this would require doing a dvc push
to the "read-only" remote to get the correct version IDs (which is impossible here since you can't push versioned data to an http remote).
For cloud versioning, the only practical way to handle this for versioned remotes would be for users to manage it on the cloud credentials side. So you would only define a single s3 remote, and then manage access through AWS permissions (where some users only have read-only access to a bucket, and some users have write/delete privileges).
It seems like maybe we should just always explicitly set remote:
on outputs whenever we push to a cloud versioned remote (if the user has not set remote:
themselves already), and from then on the given output (and pushed version IDs) would stay tied to that remote.
We could work around this by making DVC use the remotes defined in the DVC config for a particular git commit (instead of the current behavior where we use the latest commit's remote and try to get all historical DVC objects from that one remote).
Is that how it works right now for cloud versioning? Only the latest config is used?
Is that how it works right now for cloud versioning? Only the latest config is used?
For cloud versioning right now we always use either an explicit --remote
if it is set via CLI, or the latest config's default remote. (this is the bug @dberenbaum noted earlier where remote:
isn't respected for versioning).
https://github.com/iterative/dvc/blob/fc1a192c90d7abd174afd4f8329593609c2115d7/dvc/repo/worktree.py#L80-L84
https://github.com/iterative/dvc/blob/fc1a192c90d7abd174afd4f8329593609c2115d7/dvc/repo/worktree.py#L132-L133
We do actually fill the values with the correct remote per-commit when we generate the repo index for a given commit, so fixing this for fetch/pull
would really just require fixing the bug so that we use the existing index-filled remote
instead of replacing it.
For push we don't support pushing old commits since we can't rewrite git history, but we still have the same bug (with replacing the index-filled remote) to fix there as well. For push this is also a dvc-data index.checkout
issue, since checkout only supports checking out to a specific single fs/path. To support respecting remote:
we probably need to split push into separate checkout operations (for each remote
we are pushing to) and generate the appropriate indexes for each operation.
I agree that handling multiple remotes is not a critical use case. This is a p0 because it defines the metadata structure for cloud versioning. Maybe @pmrowla described the problem better in https://github.com/iterative/dvc-data/issues/241. If you look up through this issue and notice how many times it has already been referenced by other issues in early development of cloud versioning, it might help clarify that having no information about the remote associated with the etag/version_id has quickly become a mess, so we need to keep this info in the .dvc file. Even outside of cloud versioning, imports and other features that reference etags or other external data might benefit from keeping info about their remotes.
Whether to allow for one or multiple remote entries is less critical. It seemed cleaner to me since there is not much to prevent users from configuring multiple remotes and pushing to them, and I don't know how dvc would handle those situations, but it's not a blocker as a product requirement.
A 3rd use case is migrating remotes (let's say a company switches cloud providers). Without keeping track of the remote that generated each version_id, this seems like it will make it nearly impossible to keep using your existing Git repo since it will be full of version_id values that will throw errors when used on the new remote. At least DVC would know to ignore these if it were aware of the remote associated with them.
@pmrowla Migrating remote is also not a blocker, but looking at the full comment here about migrating remotes, it still seems like there are benefits to having the remote info saved:
version_id
from those old commits is associated with the old remote.md5
info could still be used to migrate old commits to a cache/non-cloud-versioned remote.Do we plan to support content-addressable storage and cloud-versioned storage at the same time? At the moment, we have worktree
/version_aware
flag at the remote level.
Also, what should happen if you do dvc push -r <content-addressable-storage>
, but you have a remote cloud-versioned-storage
specified in one of the output or vice-versa?
Since we really only support pushing to a single remote now, should we be setting remote:
on these outputs when we push to a cloud versioned remote? You can currently do
$ dvc push -r remote-A some-file
some-file.dvc file now contains
cloud:
remote-A:
version_id: ...
$ dvc push -r remote-B some-file
some-file.dvc now contains
cloud:
remote-B:
version_id: ...
but I'm not sure it makes sense for us to allow this behavior.
cloud: ...
only exists in the serialized .dvcfile right now, and is not accessible from Output
instances. For handling dvc update --no-download
, and removing support for update --remote ...
it would make more sense for us to be setting remote:
on each output, and then getting the update from that remote, otherwise we have to re-parse the .dvc file to find the first remote listed in cloud:
.
I think we should just prevent -r <remote>
workflow for cloud versioning, and expect a default remote set.
What you are saying makes sense, but I think that's more of a product question on how dvc push -r <remote>
should work, whether we should be restricting pushes to only one remote or allow pushing to other remotes too (one of the scenario above was to be able to change remotes).
meta.remote
does have remotes set, however.
I think we should just prevent
-r <remote>
workflow for cloud versioning, and expect a default remote set.
It's fine to drop -r <remote>
support (I'm not even sure it's needed without cloud versioning). However, we should respect the remote:
field rather than only working with a default remote. There might be a specific remote for a dataset that's on a cloud-versioned remote whereas the rest of the project gets pushed to a traditional remote. My only concern here is that I think there are still some open issues around remote:
support.
it would make more sense for us to be setting remote: on each output, and then getting the update from that remote
Sounds good. It probably makes sense to set the top-level remote:
field and keep the cloud:
field as is (including the remote-A:
key). For example, in the case of migration, you could update to remote: remote-B
, and we would at least know that the cloud
info for remote-A
is now invalid.
Part of #7995
Description
Cloud versioning does not support multiple remotes.
Reproduction
You can reproduce with the script below. Replace
CLOUD_REMOTE_1
andCLOUD_REMOTE_2
with your own s3 paths and configure your credentials.Expected
remote
for eachversion_id
, so that when pushing to a different remote, it appends aversion_id
for the other remote instead of overwriting it.remote:
syntax in a.dvc
file, DVC should push to that remote instead of the default.