Closed Lhenzo closed 1 year ago
The push reports Everything is up to date.
, it is not actually pushing any data to the remote. It looks like there is some problem causing the check for what files exist in the remote to take longer than expected, I'll take a look at the profiling info when I get a chance.
The reason that dvc push
does not work the same as dvc diff
is because those are two different comparisons. dvc diff
is not aware of what data has or has not been pushed into the remote. dvc diff
just compares the contents of the repo's .dvc
files and reports what changed.
It is possible that only a small number of changes occurred between the two commits, but dvc push
still has to verify whether or not all of the data from the older commit was actually pushed into the remote.
@Lhenzo in your DVC repo, are you tracking directories with dvc add
(with one .dvc
file per directory), or are you tracking a large number of files individually (with .dvc
files for each individual tracked file)?
@pmrowla I work with @Lhenzo , it might help to know that in dvc==2.10.2 we don't see the slowliness. We discovered that when we try the dvc==2.28.0 (Maybe before that is it the same).
Do you know why is it slower?
I don't understand the 2 different steps in the logs shown in the description:
2022-09-30 15:23:10,970 DEBUG: Querying 3 oids via object_exists ... 2022-09-30 15:23:11,219 DEBUG: Querying '11' oids via traverse
I don't understand the 2 different steps in the logs shown in the description:
This is an internal optimization. DVC does not check if all objects it may need to push exist in the remote at once. For tracked directories, we first check if the directory objects exist, and if so we skip checking individual files that are contained in that directory in the follow up step.
The two steps here would indicate that DVC is first checking if 3 directory objects exist, and then checks for an additional 11 single files.
Do you know why is it slower?
It sounds like there was probably a regression introduced in the google cloud library used by DVC between those two versions, it will require some more investigation to find the exact cause.
Ok very clear
Thanks @pmrowla! I think tracking directories will solve this problem. Although before the regression in google cloud library, tracking each file was not slow, it is not a good practice.
Therefore you may close the issue
@pmrowla Can we still investigate the regression or add a benchmark for it?
Looks like we can't investigate further or add benchmars for this. Closing as stale.
Bug Report
dvc push: pushing all the cache, even without modified files
Description
Context
When using DVC, tracking is powerful if any data have been added, removed, changed. The command dvc diff returns what's changed and is a useful feature. Unfortunately, it seems that it's not used to push to the remote bucket : it's not pushing only the files in the diff, but instead is trying to check/push all the cache at once. This can be a pain when large data are tracked and take several minutes instead of seconds.
It's problematic if you use hooks between git and dvc because everytime you
git push
it willdvc push
all the files again.Reproduce
dvc add
2 recently created filesdvc commit
time dvc push
2 files were pushed and it took 0m9.024s and saw
Querying cache in '..' |
and all my files were being pushedtime dvc push
againSame output, all the cache seems to be pushed again. For some of our repos it could take a very large amount of time.
Expected
dvc push
only pushes the returns ofdvc diff origin/main
or from last commit withdvc diff HEAD~1
Environment information
Output of
dvc doctor
:Additional Information (if any):
Pushing with verbosity
Profiling file
profiling