iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.67k stars 1.18k forks source link

`dvc push` / `import-url --to-remote` to remote ssh works very, very slow. #8456

Closed beiroot closed 1 year ago

beiroot commented 1 year ago

Bug Report

I think it did through enough research to report this bug. I've first try to find info online and in support threads - no luck there. I've talked on Discord with Gao and he decided I should post this as a bug here.

Description

dvc push and dvc add / import-url --to-remote works really really slow. Like, few kilobytes in 5 minutes slow. SCP and SFTP to that same server work fast.

Reproduce

touch test.txt
echo "AAAA" > test.txt
dvc add test.txt
dvc push

or

dvc import-url --to-remote https://example.com/some.jpg

The transferring process takes very very long. However, once in a (undefined) while it works fast.

Expected

Doing that same operation while remote is on local drive works blazingly fast.

Environment information

Topology of the network: MacOS --- VPN --- (over ssh) --- ML server (local folder, shared directory)

However, this process was reproduced in various environments:

And it always was very very slow.

Output of config files:

.dvc/config     remote.remote-ssh.url=ssh://ip/path/to/remote
.dvc/config     core.remote=remote-ssh
.dvc/config.local       remote.remote-ssh.user=login
.dvc/config.local       remote.remote-ssh.password=password

Output of dvc doctor:

$ dvc doctor

DVC version: 2.30.0 (pip)
---------------------------------
Platform: Python 3.9.6 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.17.1
        dvc_objects = 0.7.0
        dvc_render = 0.0.12
        dvc_task = 0.1.3
        dvclive = 0.11.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2022.6.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: ssh
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):

Here are the c-profile dumps from dvc push slow dvc push fast dvc push

If you guys want, I can send the files, directly to the dev.

As you can see, there's clearly a performance issue. We can exclude the lock of file by some cloud storage like OneDrive, Dropbox etc. IDEs like Pycharm or backup apps like Time-Machine. The bug was reproduced outside of such folders.

We can exclude the threading problem - dvc push -j 1 works slow too.

beiroot commented 1 year ago

We can exclude the threading problem - dvc push -j 1 works slow too.

AFAIK from the docs, the dvc push -j 1 is ignored for remotes, isn't it? So maybe this is the key to solve this mistery?

beiroot commented 1 year ago

@SanderNugteren, @gcoter, I've seen you also had problems with SSH / SFTP remote issues? Have you guys managed to get pass them?

beiroot commented 1 year ago

Two additional comments that might help:

  1. Even if the remote is clear, so it's a first push (using SSH), it works very very slow.
  2. I've tested on the same infrastructure, but using an HTTP endpoint:

MacOS --- VPN --- (over HTTP) --- ML server (local folder, shared directory)

and it worked like charm, but only for small files. Bigger files (like 20MB) get Timeout on reading data from socket. I can see was an issue on github and it should be fixed for dvc-objects==0.1.7, however, I have DVC dvc_objects = 0.7.0 and the problem occurs.

efiop commented 1 year ago

@michuhu One more thing, could you upgrade to the latest dvc version pip install -U "dvc[ssh]", check that dvc push is still very slow and then run

dvc push --yappi --yappi-separate-threads

, which will produce a bunch of callgrind.dvc* files in the current directory. Each file will be representing a separate thread. This will help us look deeper into what's going on in the transfer/status threads that we can see are taking a while from the cprofile results you've provided. You can analyze them yourself with kcachegrind/qcachegrind, but please also share them with us (feel free to send them privately, but they are pretty harmless).

beiroot commented 1 year ago

Ok, so now it goes flawlessly. I think. I need more tests, but it generally works. And I'm pretty sure the problem is with lock on the files caused discrepancy between cache and remote. If I deleted the folder on the remote and pushed the same files again, I reproduced the problem. Could this be the only cause?! What is the philosophy behind cache and remote? So if I push the same files, to two different remotes just by changing the config file, am I being a bad boy? This is odd since I'm pretty sure the problem with ssh was caused even with a clear dvc install. But after all this messing around, I might be wrong... I'll check that too. Anyway, I'm sending the correct and wrong callgrind files.

callgrind-correct.zip callgrind-wrong.zip

efiop commented 1 year ago

@michuhu Thanks for the research!

Ok, so now it goes flawlessly. I think. I need more tests, but it generally works. And I'm pretty sure the problem is with lock on the files caused discrepancy between cache and remote. If I deleted the folder on the remote and pushed the same files again, I reproduced the problem. Could this be the only cause?!

So it stopped being dead slow? By deleting the folder, you mean the whole remote location or just subdirs or something?

What is the philosophy behind cache and remote? So if I push the same files, to two different remotes just by changing the config file, am I being a bad boy?

That's perfectly fine to do. Both cache and remote are object storages, with the only difference that cache is assumed to be local to your workspace and is getting actively used during most operations (e.g. add/checkout/etc), while remote is (roughly) assumed to only be used during push/pull/fetch/etc but with multiple users.

beiroot commented 1 year ago

Ok, great news! this issue fixed the problem. Now, only the initial push via dvc[ssh] (the making actually making the repo folders) is slow. Everything else works fast.

Many thanks to @efiop, @pared and all DVC team!

efiop commented 1 year ago

@michuhu We made initial dir creation lazy too, new dvc versions (already released) should be faster for you.

Closing since this seems resolved.