Open peper0 opened 1 year ago
Thanks for the issue @peper0. The current behavior of dvc import
is to always download from source and never push imported data to the remote. So the dvc push
in your example should not have any impact. The idea is that in most cases users would rather access the source git repo than have an entire extra copy of the dvc-tracked data in remote storage. There's some related discussion about being able to push imports in #4527.
@dberenbaum what about the push
option of the outputs? Shouldn't it decide whether push the file to the remote?
Yes, ideally import
would set push: false
and you could change it to push: true
to get the behavior you want. Unfortunately, I don't think it's that simple today because import
predates the introduction of the push
option.
cc @efiop
If you are open to a hacky workaround for now, you could make a dvc stage that does dvc get
, which would track it as a normal output that gets pushed.
@dberenbaum Yes, that's the direction that I'm going to migrate. But it has considerable drawbacks, like no support for dvc update
.
@peper0 If your stage cmd looks like dvc get --rev some_rev repo_url path
, then you can update the --rev
field to get update
-like functionality, which AFAIK is more or less what import
does. I don't plan to close this issue since it's a legitimate request to have all this included in import
, but hopefully that at least makes it usable for now since I don't think it's something we can fix that quickly or can prioritize right now.
Description
dvc pull
clones repositories from which files were imported, even though they are cached (havecache: true
implicitly or explicitly).Reproduce
At step 5 the repository is being cloned.
Expected
I expect data to be pushed to the remote in
dvc push
and pulled from the remote indvc pull
since the data is cached by default without accessing the git repository it was imported from (unlessdvc update
is called).This is a big problem, since the git repo may be not accessible when
dvc pull
is called (e.g. when it is called by CI server). Moreover, it takes a lot of time if data is imported from several repositories with some large ones among them.In my understanding, outputs are synced with the repository only in
dvc update
anddvc import
. Not atdvc pull
ordvc repro
. Therefore I don't see why the repo would need to be accessible when callingdvc pull
Environment information
Output of
dvc doctor
: