iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.95k stars 1.19k forks source link

ssh: support different backends (CLI ssh, paramiko, libssh, etc) #5395

Closed jrollins closed 3 years ago

jrollins commented 3 years ago

I've been trying to test dvc with ssh remotes, but have yet to get it to work. I keep getting paramiko exceptions that seem to indicate that it (or dvc's usage of it) is unable to handle a lot of common ssh use cases (proxy commands, cert-authorities, etc.). I think paramiko's ssh model is fundamentally flawed, since it seems to require the user to re-implement a lot of the connection/authentication logic that the ssh CLI interface handles. That puts a huge burden on it's users, given how many use cases they'll have to try to figure out and support, and will lead to a support nightmare for products that use it, since you'll never be able to rely on the ssh CLI as a means for debugging.

I strongly suggest you try to abstract the remote backend to allow for swapping out the use of paramiko with direct calls to the ssh CLI. I think any backend abstraction will be time well spent since it will probably also allow you to support other remotes more easily.

I suggest looking at what git-annex is doing, since they support SSH remotes out-of-the-box with no issues whatsoever (even with my peculiar ssh config). They appear to just be exec'ing the ssh client directly (you can see the ssh commands in their verbose logging output). They also support a wide variety of special remotes.

Good luck!

Discord context: https://discord.com/channels/485586884165107732/485596304961962003/806310490589757450

pmrowla commented 3 years ago

For reference, this is the underlying paramiko issue: https://github.com/paramiko/paramiko/issues/771

efiop commented 3 years ago

For the record: we've discussed before that we might support different backends as we do with git https://github.com/iterative/dvc/tree/master/dvc/scm/git/backend . Likely it will be a similar situation here too: CLI ssh, pure python paramiko and libssh. Or maybe we could use CLI ssh for auth and then reuse the channel with another library, since that's where majority of the problems usually occur.

A very long time ago we used to use ssh CLI in dvc, but switched to paramiko because it was much easier to use programmatically and it works just fine in the majority of (simple)cases.

efiop commented 3 years ago

Maybe we could use https://github.com/ParallelSSH/parallel-ssh instead of raw libssh

isidentical commented 3 years ago

We've now migrated to asyncssh, so please test again on the master and let us know if you see any missing features!