iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.93k stars 1.19k forks source link

gs: potential cloud status regression between 2.10.2 and 2.28.0 #8384

Closed Lhenzo closed 1 year ago

Lhenzo commented 2 years ago

Bug Report

dvc push: pushing all the cache, even without modified files

Description

Context

When using DVC, tracking is powerful if any data have been added, removed, changed. The command dvc diff returns what's changed and is a useful feature. Unfortunately, it seems that it's not used to push to the remote bucket : it's not pushing only the files in the diff, but instead is trying to check/push all the cache at once. This can be a pain when large data are tracked and take several minutes instead of seconds.

It's problematic if you use hooks between git and dvc because everytime you git push it will dvc push all the files again.

Reproduce

  1. dvc add 2 recently created files
  2. dvc commit
  3. time dvc push

2 files were pushed and it took 0m9.024s and saw Querying cache in '..' | and all my files were being pushed

  1. time dvc push again

Same output, all the cache seems to be pushed again. For some of our repos it could take a very large amount of time.

Expected

dvc push only pushes the returns of dvc diff origin/main or from last commit with dvc diff HEAD~1

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.28.0 (pip)
---------------------------------
Platform: Python 3.10.6 on Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.13.0
        dvc_objects = 0.5.0
        dvc_render = 0.0.11
        dvc_task = 0.1.2
        dvclive = 0.11.0
        scmrepo = 0.1.1
Supports:
        gs (gcsfs = 2022.7.1),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: gs
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git

Additional Information (if any):

Pushing with verbosity

dvc push -v
2022-09-30 15:23:10,969 DEBUG: Preparing to transfer data from '/home/lorenzofurlan/repos/swirl-data/.dvc/cache' to 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Preparing to collect status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,969 DEBUG: Collecting status from 'descartes-swirl-dvc'
2022-09-30 15:23:10,970 DEBUG: Querying 3 oids via object_exists
2022-09-30 15:23:11,175 DEBUG: Querying 0 oids via object_exists
2022-09-30 15:23:11,219 DEBUG: Estimated remote size: 4096 files
2022-09-30 15:23:11,219 DEBUG: Querying '11' oids via traverse
Everything is up to date.
2022-09-30 15:23:19,932 DEBUG: Analytics is enabled.
2022-09-30 15:23:19,957 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'
2022-09-30 15:23:19,958 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpv8_66g8l']'

Profiling file

profiling

pmrowla commented 2 years ago

The push reports Everything is up to date., it is not actually pushing any data to the remote. It looks like there is some problem causing the check for what files exist in the remote to take longer than expected, I'll take a look at the profiling info when I get a chance.

The reason that dvc push does not work the same as dvc diff is because those are two different comparisons. dvc diff is not aware of what data has or has not been pushed into the remote. dvc diff just compares the contents of the repo's .dvc files and reports what changed.

It is possible that only a small number of changes occurred between the two commits, but dvc push still has to verify whether or not all of the data from the older commit was actually pushed into the remote.

pmrowla commented 2 years ago

@Lhenzo in your DVC repo, are you tracking directories with dvc add (with one .dvc file per directory), or are you tracking a large number of files individually (with .dvc files for each individual tracked file)?

mdeboc commented 2 years ago

@pmrowla I work with @Lhenzo , it might help to know that in dvc==2.10.2 we don't see the slowliness. We discovered that when we try the dvc==2.28.0 (Maybe before that is it the same).

Do you know why is it slower?

mdeboc commented 2 years ago

I don't understand the 2 different steps in the logs shown in the description:

2022-09-30 15:23:10,970 DEBUG: Querying 3 oids via object_exists ... 2022-09-30 15:23:11,219 DEBUG: Querying '11' oids via traverse

pmrowla commented 2 years ago

I don't understand the 2 different steps in the logs shown in the description:

This is an internal optimization. DVC does not check if all objects it may need to push exist in the remote at once. For tracked directories, we first check if the directory objects exist, and if so we skip checking individual files that are contained in that directory in the follow up step.

The two steps here would indicate that DVC is first checking if 3 directory objects exist, and then checks for an additional 11 single files.

Do you know why is it slower?

It sounds like there was probably a regression introduced in the google cloud library used by DVC between those two versions, it will require some more investigation to find the exact cause.

mdeboc commented 2 years ago

Ok very clear

Thanks @pmrowla! I think tracking directories will solve this problem. Although before the regression in google cloud library, tracking each file was not slow, it is not a good practice.

Therefore you may close the issue

dberenbaum commented 2 years ago

@pmrowla Can we still investigate the regression or add a benchmark for it?

efiop commented 1 year ago

Looks like we can't investigate further or add benchmars for this. Closing as stale.