iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

DVC push hangs for a long time due to run-cache = True #10449

Open michaelromagne opened 3 weeks ago

michaelromagne commented 3 weeks ago

Bug Report

Issue name

push: hangs forever since 3.51.1 : https://github.com/iterative/dvc/pull/10433

Description

The issue happens with Ubuntu, it's fixed when I run dvc push with the run_cache option at False or with dvc 3.50.2. The remote is S3. We often run heavy DVC pipelines which lead to many small files in the run cache. However even after cleaning cache the issue is still here.

Do you have any idea why ? Sorry for not providing more context. Thanks a lot

dberenbaum commented 3 weeks ago

I'm not sure why it's slow to push the run-cache. Does it eventually push the run-cache successfully? Do you have a need for it or is it fine for you to push without the run-cache?

michaelromagne commented 3 weeks ago

Our run-cache is pretty small. The run-cache is not eventually pushed, it hangs at the namespace step. It's fine for me not to push the run-cache, I will set it to False :smile:

Screenshot from 2024-06-07 08-44-40

Screenshot from 2024-06-07 08-45-01

Wirg commented 1 week ago

Hi @dberenbaum ,

I dug a little bit into this.

We are using s3 with s3 versioning. https://dvc.org/doc/user-guide/data-management/cloud-versioning We have approximately 1000 runs keys that are looked for in StageCache.transfer. https://github.com/iterative/dvc/blob/main/dvc/stage/cache.py#L264

It turns out that what is taking a lot of time is to_fs.fs.find . Apparently, we are stuck by S3FileSystem.find here https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L896 which runs in a lot of locks due to the sync. image

Here is the detailed profile graph. profile_dvc_push

It's unclear to me if it's an issue with the way the S3FileSystem is used in dvc, if it's a bug in s3fs or if it's to be expected. Do you have any clue @dberenbaum ?

dberenbaum commented 1 week ago

Unfortunately, we are dropping support for cloud-versioned remotes due to a lot of small inconsistencies like this that have made it beyond what we have the capability to maintain.

@michaelromagne Are you also using a cloud-versioned remote?

michaelromagne commented 1 week ago

Yes S3, I actually work with @Wirg

shcheklein commented 1 week ago

@dberenbaum q - is it actually related to remotes being versioned, or does it affect also the regular one? (it's not obvious, just to clarify)

dberenbaum commented 15 hours ago

@michaelromagne @Wirg Can you try installing from https://github.com/iterative/dvc/pull/10472 (pip install git+ssh://git@github.com:iterative/dvc.git@run-cache-push-force)? Is it any faster?