Open michaelromagne opened 3 weeks ago
I'm not sure why it's slow to push the run-cache. Does it eventually push the run-cache successfully? Do you have a need for it or is it fine for you to push without the run-cache?
Our run-cache is pretty small. The run-cache is not eventually pushed, it hangs at the namespace step. It's fine for me not to push the run-cache, I will set it to False :smile:
Hi @dberenbaum ,
I dug a little bit into this.
We are using s3 with s3 versioning. https://dvc.org/doc/user-guide/data-management/cloud-versioning
We have approximately 1000 runs keys that are looked for in StageCache.transfer
. https://github.com/iterative/dvc/blob/main/dvc/stage/cache.py#L264
It turns out that what is taking a lot of time is to_fs.fs.find
.
Apparently, we are stuck by S3FileSystem.find
here https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L896 which runs in a lot of locks due to the sync.
Here is the detailed profile graph.
It's unclear to me if it's an issue with the way the S3FileSystem is used in dvc, if it's a bug in s3fs or if it's to be expected. Do you have any clue @dberenbaum ?
Unfortunately, we are dropping support for cloud-versioned remotes due to a lot of small inconsistencies like this that have made it beyond what we have the capability to maintain.
@michaelromagne Are you also using a cloud-versioned remote?
Yes S3, I actually work with @Wirg
@dberenbaum q - is it actually related to remotes being versioned, or does it affect also the regular one? (it's not obvious, just to clarify)
@michaelromagne @Wirg Can you try installing from https://github.com/iterative/dvc/pull/10472 (pip install git+ssh://git@github.com:iterative/dvc.git@run-cache-push-force
)? Is it any faster?
Bug Report
Issue name
push: hangs forever since 3.51.1 : https://github.com/iterative/dvc/pull/10433
Description
The issue happens with Ubuntu, it's fixed when I run dvc push with the run_cache option at False or with dvc 3.50.2. The remote is S3. We often run heavy DVC pipelines which lead to many small files in the run cache. However even after cleaning cache the issue is still here.
Do you have any idea why ? Sorry for not providing more context. Thanks a lot