Closed gcoter closed 1 year ago
@gcoter Maybe you could try to reproduce it with a docker image with the same config options? So far we are not able to reproduce ourselves.
@efiop Yes it is a good idea, actually I deployed the SFTP server as a docker container on my Raspberry Pi. However, the image I used has been built for ARM. I will try to reproduce the error locally on my computer with the original docker image.
I created a quick SSH server in a Docker container, mounted my DVC cache into it, and set it as the remote for my local project. Running dvc pull
didn't result in an error, but it did open over 1000 connections (according to netstat -tn
). Is that expected?
Also potentially relevant: when I get this error, it is always during the step of querying the remote cache. If I retry enough times and get past that step, then the actual up/downloading always succeeds.
I created a quick SSH server in a Docker container, mounted my DVC cache into it, and set it as the remote for my local project. Running
dvc pull
didn't result in an error, but it did open over 1000 connections (according tonetstat -tn
). Is that expected?
Do you set any --job
related config and how many cores are there in your computer.
Do you set any
--job
related config and how many cores are there in your computer.
No, I didn't use the --job
flag. 8 cores, 16 threads.
Looks like we need to add some limitations on the status
query.
Hi all, sorry for my late response! As I was about to try to reproduce this issue locally (as proposed by @efiop), I upgraded dvc to the last version (2.9.2) and now it works :slightly_smiling_face:
@sjawhar Maybe you can try it and confirm whether it solves the issue for you as well?
Unfortunately still an issue on 2.9.3
$ dvc pull --verbose --recursive pipelines/finger_tapping/
2021-12-29 21:15:54,657 DEBUG: Adding '/home/user/app/.dvc/config.local' to gitignore file.
2021-12-29 21:15:54,679 DEBUG: Adding '/home/user/app/.dvc/tmp' to gitignore file.
2021-12-29 21:15:54,679 DEBUG: Adding '/home/user/app/.dvc/cache' to gitignore file.
2021-12-29 21:15:54,687 DEBUG: Checking if stage 'pipelines/finger_tapping/' is in 'dvc.yaml'
2021-12-29 21:15:55,608 DEBUG: Preparing to transfer data from '/usr/data/project/dvc' to '/home/user/app/.dvc/cache'
2021-12-29 21:15:55,608 DEBUG: Preparing to collect status from '/home/user/app/.dvc/cache'
2021-12-29 21:15:55,619 DEBUG: Collecting status from '/home/user/app/.dvc/cache'
2021-12-29 21:15:55,735 DEBUG: Preparing to collect status from '/usr/data/project/dvc'
2021-12-29 21:15:55,740 DEBUG: Collecting status from '/usr/data/project/dvc'
2021-12-29 21:15:55,856 DEBUG: Querying 126 hashes via object_exists
2021-12-29 21:16:14,510 ERROR: unexpected error - Can't create any SFTP connections!
------------------------------------------------------------
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/data_sync.py", line 30, in run
stats = self.repo.pull(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/pull.py", line 29, in pull
processed_files_count = self.fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 67, in fetch
d, f = _fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 87, in _fetch
downloaded += repo.cloud.pull(obj_ids, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/data_cloud.py", line 114, in pull
return transfer(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/transfer.py", line 153, in transfer
status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 164, in compare_status
src_exists, src_missing = status(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 122, in status
exists = hashes.intersection(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 36, in _indexed_dir_hashes
indexed_dir_exists.update(odb.list_hashes_exists(indexed_dirs))
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 421, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 412, in exists_with_progress
ret = self.fs.exists(fs_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/fsspec_wrapper.py", line 91, in exists
return self.fs.exists(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise return_result
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 549, in _exists
await self._info(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/spec.py", line 135, in _info
async with self._pool.get() as channel:
File "/usr/local/lib/python3.8/contextlib.py", line 171, in __aenter__
return await self.gen.__anext__()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/pools/soft.py", line 38, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
------------------------------------------------------------
2021-12-29 21:16:14,984 DEBUG: Adding '/home/user/app/.dvc/config.local' to gitignore file.
2021-12-29 21:16:14,990 DEBUG: Adding '/home/user/app/.dvc/tmp' to gitignore file.
2021-12-29 21:16:14,990 DEBUG: Adding '/home/user/app/.dvc/cache' to gitignore file.
2021-12-29 21:16:14,991 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross-device link
------------------------------------------------------------
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/data_sync.py", line 30, in run
stats = self.repo.pull(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/pull.py", line 29, in pull
processed_files_count = self.fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 67, in fetch
d, f = _fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 87, in _fetch
downloaded += repo.cloud.pull(obj_ids, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/data_cloud.py", line 114, in pull
return transfer(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/transfer.py", line 153, in transfer
status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 164, in compare_status
src_exists, src_missing = status(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 122, in status
exists = hashes.intersection(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 36, in _indexed_dir_hashes
indexed_dir_exists.update(odb.list_hashes_exists(indexed_dirs))
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 421, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 412, in exists_with_progress
ret = self.fs.exists(fs_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/fsspec_wrapper.py", line 91, in exists
return self.fs.exists(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise return_result
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 549, in _exists
await self._info(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/spec.py", line 135, in _info
async with self._pool.get() as channel:
File "/usr/local/lib/python3.8/contextlib.py", line 171, in __aenter__
return await self.gen.__anext__()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/pools/soft.py", line 38, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
func(from_path, to_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/local.py", line 148, in reflink
System.reflink(from_info, to_info)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
System._reflink_linux(source, link_name)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
return _link(link, from_fs, from_path, to_fs, to_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
_try_links([link], from_fs, from_file, to_fs, to_file)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2021-12-29 21:16:14,992 DEBUG: Removing '/home/user/.VxePWiE728u2v5gnSQT3vY.tmp'
2021-12-29 21:16:14,992 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'hardlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross-device link: '/home/user/app/.dvc/cache/.RhJXijpmS46m4MKMHUFkXk.tmp' -> '/home/user/.VxePWiE728u2v5gnSQT3vY.tmp'
------------------------------------------------------------
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/command/data_sync.py", line 30, in run
stats = self.repo.pull(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/pull.py", line 29, in pull
processed_files_count = self.fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/__init__.py", line 49, in wrapper
return f(repo, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 67, in fetch
d, f = _fetch(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/repo/fetch.py", line 87, in _fetch
downloaded += repo.cloud.pull(obj_ids, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/data_cloud.py", line 114, in pull
return transfer(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/transfer.py", line 153, in transfer
status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 164, in compare_status
src_exists, src_missing = status(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 122, in status
exists = hashes.intersection(
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/status.py", line 36, in _indexed_dir_hashes
indexed_dir_exists.update(odb.list_hashes_exists(indexed_dirs))
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 421, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/objects/db/base.py", line 412, in exists_with_progress
ret = self.fs.exists(fs_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/fsspec_wrapper.py", line 91, in exists
return self.fs.exists(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 91, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise return_result
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 549, in _exists
await self._info(path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/spec.py", line 135, in _info
async with self._pool.get() as channel:
File "/usr/local/lib/python3.8/contextlib.py", line 171, in __aenter__
return await self.gen.__anext__()
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/sshfs/pools/soft.py", line 38, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
func(from_path, to_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/local.py", line 141, in hardlink
System.hardlink(from_info, to_info)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/system.py", line 39, in hardlink
os.link(src, link_name)
OSError: [Errno 18] Invalid cross-device link: '/home/user/app/.dvc/cache/.RhJXijpmS46m4MKMHUFkXk.tmp' -> '/home/user/.VxePWiE728u2v5gnSQT3vY.tmp'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
return _link(link, from_fs, from_path, to_fs, to_path)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
raise OSError(
OSError: [Errno 95] 'hardlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
_try_links([link], from_fs, from_file, to_fs, to_file)
File "/home/user/.cache/pypoetry/virtualenvs/app-gcEbEm5J-py3.8/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2021-12-29 21:16:14,993 DEBUG: Removing '/home/user/.VxePWiE728u2v5gnSQT3vY.tmp'
2021-12-29 21:16:14,993 DEBUG: Removing '/home/user/.VxePWiE728u2v5gnSQT3vY.tmp'
2021-12-29 21:16:14,993 DEBUG: Removing '/home/user/app/.dvc/cache/.RhJXijpmS46m4MKMHUFkXk.tmp'
2021-12-29 21:16:15,000 DEBUG: Version info for developers:
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.8 on Linux-5.15.8-76051508-generic-x86_64-with-glibc2.2.5
Supports:
hdfs (fsspec = 2021.11.1, pyarrow = 4.0.1),
webhdfs (fsspec = 2021.11.1),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
ssh (sshfs = 2021.11.2)
Cache types: symlink
Cache directory: ext4 on /dev/mapper/data-root
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/mapper/data-root
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2021-12-29 21:16:15,002 DEBUG: Analytics is enabled.
2021-12-29 21:16:15,085 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpiyk0nprz']'
2021-12-29 21:16:15,088 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpiyk0nprz']'
@sjawhar Could you try --jobs 1
?
Also, how many dvc files do you have in pipelines/finger_tapping/
? Could you run
find pipelines/finger_tapping -type f -name '*.dvc' | wc -l
I had the same error. Unfortunately, it is hard to replicate:
dvc push
in repository root - no errordvc push
in repository root on - Errordvc push --recursive .
in each subdirectory (500 files/500GB total; 30 files/100GB total; 1-5 files/50GB). Files hat sizes up to 90GB - No Errordvc push
in repository root on after doing it in each subdirectory (see above) - no error anymoreI also have directories with tens of thousands of files in the repository. But they where pushed in the past and there was no error.
@gcoter @sjawhar Maybe you can try it with an much older version, e.g. 1.2.x. In the past, I did exactly the same steps that resulted in this error with the newest version. With 1.2.x it worked in the past.
Hi @mistermult, thanks for your feedbacks š Indeed, using an older version worked for me as well and I encountered this issue when using a more recent version.
But since I have upgraded DVC (https://github.com/iterative/dvc-ssh/issues/16), I don't have this issue anymore. Which version of DVC are you using? In my case, I think the issue disappeared after version 2.9.2.
But it is weird because upgrading did not work for @sjawhar š
@sjawhar Could you try
--jobs 1
?Also, how many dvc files do you have in
pipelines/finger_tapping/
? Could you runfind pipelines/finger_tapping -type f -name '*.dvc' | wc -l
I've tried with --jobs 1
, still fails intermittently. Upgrading to 2.8.x helped a bit, it fails less often, but still intermittently. I can't yet upgrade to 2.9.x because that completely breaks on our infrastructure and I haven't had time to figure out why.
There are quite a few outputs in this repo, which might be why I have the issue. There are 18 or so .dvc files, which each track a directory that contains several files. Then each of those 18 directories gets processed through 10 or so stages (foreach
), each of which also outputs a directory. So, lots of files, lots of directories.
I'm getting this error still with 2.9.5
Hi, I am having the same issue with "unexpected error - Can't create any SFTP connections!" when running dvc push/pull.
Would appreciate any help!
@ilankor reached out to us on discord:
stack trace:
2022-03-07 16:39:45,515 DEBUG: Preparing to transfer data from 'ssh://server/' to '.dvc/cache'
2022-03-07 16:39:45,516 DEBUG: Preparing to collect status from '.dvc/cache'
2022-03-07 16:39:46,512 DEBUG: Collecting status from '.dvc/cache'
2022-03-07 16:40:01,213 DEBUG: Preparing to collect status from 'ssh://server/'
2022-03-07 16:40:02,125 DEBUG: Collecting status from 'ssh://server'
2022-03-07 16:40:02,126 DEBUG: Querying 128 hashes via object_exists
2022-03-07 16:40:05,930 ERROR: unexpected error - Can't create any SFTP connections!
------------------------------------------------------------
Traceback (most recent call last):
File "/home/eden/.local/lib/python3.6/site-packages/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/eden/.local/lib/python3.6/site-packages/dvc/command/base.py", line 45, in do_run
return self.run()
File "/home/eden/.local/lib/python3.6/site-packages/dvc/command/data_sync.py", line 41, in run
glob=self.args.glob,
File "/home/eden/.local/lib/python3.6/site-packages/dvc/repo/__init__.py", line 50, in wrapper
return f(repo, *args, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/repo/pull.py", line 38, in pull
run_cache=run_cache,
File "/home/eden/.local/lib/python3.6/site-packages/dvc/repo/__init__.py", line 50, in wrapper
return f(repo, *args, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/repo/fetch.py", line 72, in fetch
odb=odb,
File "/home/eden/.local/lib/python3.6/site-packages/dvc/repo/fetch.py", line 87, in _fetch
downloaded += repo.cloud.pull(obj_ids, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/data_cloud.py", line 121, in pull
verify=odb.verify,
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/transfer.py", line 153, in transfer
status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/status.py", line 167, in compare_status
src, obj_ids, index=src_index, **kwargs
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/status.py", line 123, in status
_indexed_dir_hashes(odb, index, dir_objs, name, cache_odb)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/status.py", line 48, in _indexed_dir_hashes
dir_exists.update(odb.list_hashes_exists(dir_hashes - dir_exists))
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/db/base.py", line 415, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/objects/db/base.py", line 406, in exists_with_progress
ret = self.fs.exists(path_info)
File "/home/eden/.local/lib/python3.6/site-packages/dvc/fs/fsspec_wrapper.py", line 136, in exists
return self.fs.exists(self._with_bucket(path_info))
File "/home/eden/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 91, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 71, in sync
raise return_result
File "/home/eden/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/eden/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 555, in _exists
await self._info(path)
File "/home/eden/.local/lib/python3.6/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "/home/eden/.local/lib/python3.6/site-packages/sshfs/spec.py", line 135, in _info
async with self._pool.get() as channel:
File "/home/eden/.local/lib/python3.6/site-packages/sshfs/compat.py", line 23, in __aenter__
return await self.gen.__anext__()
File "/home/eden/.local/lib/python3.6/site-packages/sshfs/pools/soft.py", line 38, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
------------------------------------------------------------
2022-03-07 16:40:06,057 DEBUG: Version info for developers:
DVC version: 2.8.1 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-4.15.0-169-generic-x86_64-with-Ubuntu-18.04-bionic
Supports:
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
ssh (sshfs = 2021.11.2)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda1
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/sda1
Repo: dvc, git
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-03-07 16:40:06,058 DEBUG: Analytics is enabled.
2022-03-07 16:40:06,083 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpjgmju351']'
2022-03-07 16:40:06,094 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpjgmju351']'
discussion: https://discord.com/channels/485586884165107732/485596304961962003/950780018999562250
@dtrifiro You might want to keep an eye on this one.
Thanks! The strange thing is, other users can run "dvc push/pull" from their profiles and the same server
I am having the same issue with dvc 2.41.1 (dvc-ssh 2.20.0) when I try to push on the ssh server two csv files (each about 50 MB). The Server has
OpenSSH_7.9p1 and I can transfer files with scp
to the folder which I specified in dvc remote add
('/opt/textminer/text_miner_dvc' in my case). Here is the stack trace:
2023-01-11 10:33:38,297 DEBUG: Preparing to transfer data from '/home/usr/git/textminer_api/.dvc/cache' to '/opt/textminer/text_miner_dvc'
2023-01-11 10:33:38,298 DEBUG: Preparing to collect status from '/opt/textminer/text_miner_dvc'
2023-01-11 10:33:38,298 DEBUG: Collecting status from '/opt/textminer/text_miner_dvc'
2023-01-11 10:33:38,794 ERROR: unexpected error - Can't create any SFTP connections!
------------------------------------------------------------
Traceback (most recent call last):
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/cli/__init__.py", line 185, in main
ret = cmd.do_run()
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
return self.run()
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/commands/data_sync.py", line 59, in run
processed_files_count = self.repo.push(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/repo/push.py", line 92, in push
result = self.cloud.push(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/data_cloud.py", line 143, in push
return self.transfer(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc/data_cloud.py", line 124, in transfer
return transfer(src_odb, dest_odb, objs, **kwargs)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_data/hashfile/transfer.py", line 190, in transfer
status = compare_status(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
dest_exists, dest_missing = status(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_data/hashfile/status.py", line 151, in status
odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 367, in oids_exist
remote_size, remote_oids = self._estimate_remote_size(
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 244, in _estimate_remote_size
remote_oids = set(iter_with_pbar(oids))
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 234, in iter_with_pbar
for oid in oids:
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 200, in _oids_with_limit
for oid in self._list_oids(prefix):
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 190, in _list_oids
for path in self._list_paths(prefix):
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/db.py", line 174, in _list_paths
yield from self.fs.find(self.fs.path.join(*parts), prefix=bool(prefix))
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/dvc_objects/fs/base.py", line 366, in find
yield from self.fs.find(path)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 113, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 98, in sync
raise return_result
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in _runner
result[0] = await coro
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 735, in _find
async for _, dirs, files in self._walk(path, maxdepth, detail=True, **kwargs):
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 607, in _walk
listing = await self._ls(path, detail=True, **kwargs)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/sshfs/spec.py", line 197, in _ls
async with self._pool.get() as channel:
File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
return await self.gen.__anext__()
File "/home/usr/git/textminer_api/analysis_of_existing_model/.venv/lib/python3.8/site-packages/sshfs/pools/soft.py", line 38, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
------------------------------------------------------------
2023-01-11 10:33:38,985 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-01-11 10:33:38,985 DEBUG: Removing '/home/usr/git/.EXtcfySXBt3evPxN663izM.tmp'
2023-01-11 10:33:38,985 DEBUG: Removing '/home/usr/git/.EXtcfySXBt3evPxN663izM.tmp'
2023-01-11 10:33:38,985 DEBUG: Removing '/home/usr/git/.EXtcfySXBt3evPxN663izM.tmp'
2023-01-11 10:33:38,986 DEBUG: Removing '/home/usr/git/textminer_api/.dvc/cache/.GLXQSdg6vWyNowTkYfQpk2.tmp'
2023-01-11 10:33:38,989 DEBUG: Version info for developers:
DVC version: 2.41.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.15.0-56-generic-x86_64-with-glibc2.29
Subprojects:
dvc_data = 0.29.0
dvc_objects = 0.14.1
dvc_render = 0.0.17
dvc_task = 0.1.9
dvclive = 1.3.2
scmrepo = 0.1.5
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
ssh (sshfs = 2022.6.0)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Same here on dvc 2.43.1. Setting the log level to DEBUG2
for the ssh server gives lines like these:
debug1: channel 8: new [server-session]
debug2: session_new: allocate (allocated 8 max 10)
debug1: session_new: session 8
debug1: session_open: channel 8
debug1: session_open: session 8: link with channel 8
debug1: server_input_channel_open: confirm session
debug1: server_input_channel_open: ctype session rchan 9 win 2097152 max 32768
debug1: input_session_request
debug1: channel 9: new [server-session]
debug2: session_new: allocate (allocated 9 max 10)
debug1: session_new: session 9
debug1: session_open: channel 9
debug1: session_open: session 9: link with channel 9
debug1: server_input_channel_open: confirm session
debug1: server_input_channel_open: ctype session rchan 10 win 2097152 max 32768
debug1: input_session_request
debug2: channel: expanding 20
debug1: channel 10: new [server-session]
debug1: session_open: channel 10
error: no more sessions
debug1: session open failed, free channel 10
debug1: channel 10: free: server-session, nchannels 11
debug1: server_input_channel_open: failure session
A workaround is to increase the value of MaxSessions
in sshd_config
(default is 10).
@Cnly Thanks for the research and detailed report! Looks like we might need to tweak the max_sessions
option for sshfs
, or maybe we should lower the default (currently 10) in https://github.com/fsspec/sshfs/blob/b912e88d4a81d15cc660f3cb2f3a52480306d277/sshfs/spec.py#L28 or we might want to switch to SFTPHardChannelPool
by default. The latter seems to be the best one. Maybe you could try adjusting it (just need to pass pool_type=SFTPHardChannelPool
in https://github.com/iterative/dvc-ssh/blob/a0233830e777ccd5a8d3e2e66edc5d828dace067/dvc_ssh/__init__.py#L116 and contribute a patch if it works for you?
@efiop Thanks for the quick response! Unfortunately pool_type=SFTPHardChannelPool
doesn't seem to fix the issue.
...
File "xxx/venv/lib/python3.8/site-packages/dvc_objects/executors.py", line 134, in batch_coros
result = fut.result()
File "xxx/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 568, in _exists
await self._info(path)
File "xxx/venv/lib/python3.8/site-packages/sshfs/utils.py", line 27, in wrapper
return await func(*args, **kwargs)
File "xxx/venv/lib/python3.8/site-packages/sshfs/spec.py", line 125, in _info
async with self._pool.get() as channel:
File "xxx/.pyenv/versions/3.8.12/lib/python3.8/contextlib.py", line 171, in __aenter__
return await self.gen.__anext__()
File "xxx/venv/lib/python3.8/site-packages/sshfs/pools/hard.py", line 28, in get
raise ValueError("Can't create any SFTP connections!")
ValueError: Can't create any SFTP connections!
I also tried modifying _DEFAULT_MAX_SESSIONS
in sshfs/spec.py
, but it seems it still tried to create 10 sessions according to the ssh server logs.
Thanks, @Cnly ! š Looks like we'll need a bit more research here.
We were getting the same issues and rm -rf .dvc/tmp/*
solved them. Disclaimer, sorry if it has some other sideeffects:)
To add to my workaround above, originally I increased MaxSessions
to 20 which solved the problem, but now it's happening again and I have to increase it to 30.
For the records: With DVC version 2.43.2 (pip) I also encountered the "SSH: ValueError: Can't create any SFTP connections!" error message, but with the --jobs 1
flag the dvc push
command worked again.
I'm running dvc v2.47.0 and having this same issue. Used --jobs 1 as suggested above, and it worked.
Running a plain dvc push
afterwards, I get the error.
> dvc doctor
DVC version: 2.47.0 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-5.10.0-21-amd64-x86_64-with-glibc2.31
Subprojects:
Supports:
azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
gdrive (pydrive2 = 1.15.1),
gs (gcsfs = 2023.3.0),
hdfs (fsspec = 2023.3.0, pyarrow = 11.0.0),
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
oss (ossfs = 2021.8.0),
s3 (s3fs = 2023.3.0, boto3 = 1.24.59),
ssh (sshfs = 2023.1.0),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8),
webhdfs (fsspec = 2023.3.0)
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/nvme1n1p1
Caches: local
Remotes: ssh
Workspace directory: btrfs on /dev/nvme1n1p1
Repo: dvc, git
neofetch:
john@john-OptiPlex-7040
-----------------------
OS: Debian GNU/Linux 11 (bullseye) x86_64
Host: OptiPlex 7040
Kernel: 5.10.0-21-amd64
Uptime: 3 days, 10 hours, 59 mins
Packages: 2876 (dpkg)
Shell: zsh /usr/bin/zsh: /home/john/anaconda3/envs/neurogram/lib/libtinfo.so.6: no versi
Resolution: 3840x2160
Terminal: /dev/pts/5
CPU: Intel i7-6700 (8) @ 4.000GHz
GPU: Intel HD Graphics 530
GPU: NVIDIA GeForce GT 1030
Memory: 14225MiB / 48074MiB
I've traced the error to this dvc version: dvc==2.42.0
In DVC version 2.41.1
the dvc push
still work.
In DVC version 2.42.0
the push
no longer works, and an error is produced:
ERROR: unexpected error - Can't create any SFTP connections!
Error persists in the latest (as of today) version dvc == 2.50.1
.
@drozzy Do you get ERROR: unexpected error - Can't create any SFTP connections!
with dvc push --jobs 1
?
I've only taken a quick look based on the version numbers you provided. I'm wondering if it could be related to https://github.com/iterative/dvc-objects/pull/187, which was released in dvc-objects 0.18.1 -> dvc-data 0.34.0 -> dvc 2.42.0 š .
It's probably https://github.com/iterative/dvc-objects/pull/180. We are much more aggressive than before about trying to keep the # of active coroutines saturated in the new(er) batching behavior. I think we probably only need to cap the the default fs._JOBS
value in dvc-ssh so that it aligns with the underlying sshfs max_sessions/pool size.
Same problem with v2.51.
Adding --jobs=1
allowed the push to go through.
But pretty soon I am going to push about 15 TB of data to a backup remote and I'd like to be able to do this with maximum efficiency. Even if there is not a fix yet, I'd like to know how to calculate what value to give --jobs
in order to maximise throughput but still work.
@johnyaku Are you depending on the changes from 2.41.1 to 2.51? If not i would suggest you simply install version 2.41.1 and then push those 15TB
The 15 TB is split across two caches. (We had to move the cache when the old volume filled up, but the new volume has had gc
run on it so some of the earlier data is not there any more. It is all on a GCS remote, and so we could pull it all and then push it to the SSH remote, but the egress costs would be a killer.
Instead I want to exploit --allow-missing
to push to the new remote from two different instances of the registry -- one pointing at the old cache, and the other pointing at the new cache. Between them we should be able to get all of the data onto the new remote, without any egress costs. We'll keep the GCS as insurance, and for when we compute on GCP, but for on-prem compute we will use the new remote.
It's probably iterative/dvc-objects#180. We are much more aggressive than before about trying to keep the # of active coroutines saturated in the new(er) batching behavior. I think we probably only need to cap the the default
fs._JOBS
value in dvc-ssh so that it aligns with the underlying sshfs max_sessions/pool size.
@pmrowla What's the level of effort to do this?
Capping the default jobs value is a one-line change, but the real issue here is that with SSH a lot of it depends on the user's network as well as the actual SSH server they are connecting to. There is not really a one size fits all default value that will work for everyone, which is why the --jobs
option exists.
To clarify for anyone experiencing this issue, the default --jobs
value is 4 times the # of CPU cores available on a user's machine. In older DVC releases, the default value for SSH remotes used to be limited to 4 (independent of the # of CPU cores). In current releases, the SSH specific limit is lifted, since 4 is too low and will artificially limit performance for a lot of users.
If you experience the "cannot create SFTP connections" error, the suggested fix is to try running with smaller --jobs
values. The suggestion in this thread to use --jobs=1
will generally fix the issue, but with significantly worse performance than if use higher --jobs
values. Ideally, you want to use the largest --jobs
value that does not cause issues with your particular client + SSH server setup (and it will likely require some trial and error to find the ideal value).
Once you've identified a value that works for you, you can also set it as a part of your remote config so that you don't need to specify --jobs
on the command line each time.
$ dvc remote modify [--local] my-ssh-remote jobs 4
(the use of --local
is optional, but this config option is likely specific to your particular machine and probably does not need to be git committed in the default repo-wide .dvc/config
)
I believe the number of sessions is set by the MaxSessions parameter in /etc/ssh/sshd_config on the server, and I believe the default is 10. Setting the --jobs parameter to 4x the number of cores is grossly over the limit of 10 for all but one- or two-core cpus. For example, my cpu has 24 cores, 32 threads, which would be 96 or 128 simultaneous connections.
A safe default for --jobs would seem to be 8. This would allow for two other ssh/sftp connections from other applications. Those running high capacity servers could increase --jobs as they see fit, and as their MaxSessions allows.
Capping the default jobs value is a one-line change, but the real issue here is that with SSH a lot of it depends on the user's network as well as the actual SSH server they are connecting to. There is not really a one size fits all default value that will work for everyone, which is why the --jobs option exists.
Putting https://github.com/iterative/dvc-ssh/issues/16#issuecomment-1491774178 in the docs would be nice but, given the number of reports here and in other channels, I think we should also go back to a more conservative default number and prevent errors for the average user / default case.
A safe default for --jobs would seem to be 8. This would allow for two other ssh/sftp connections from other applications. Those running high capacity servers could increase --jobs as they see fit, and as their MaxSessions allows.
8 or more conservative 4 sound good to me.
@pmrowla did we get many complains about performance when it was capped at 4?
I believe the number of sessions is set by the MaxSessions parameter in /etc/ssh/sshd_config on the server, and I believe the default is 10.
This is correct, and we do have client level maximum sessions value which is set to have a limit of 10, regardless of your --jobs
setting
Setting the --jobs parameter to 4x the number of cores is grossly over the limit of 10 for all but one- or two-core cpus. For example, my cpu has 24 cores, 32 threads, which would be 96 or 128 simultaneous connections.
A safe default for --jobs would seem to be 8. This would allow for two other ssh/sftp connections from other applications. Those running high capacity servers could increase --jobs as they see fit, and as their MaxSessions allows.
@pmrowla did we get many complains about performance when it was capped at 4?
These questions are related, and no, it was not changed due to performance complaints. The issue is that the way --jobs
setting in DVC works has changed vs the old behavior now that we use fsspec under the hood, and it seemed unnecessary to maintain separate defaults per DVC remote type.
Previously, --jobs
was a hard limit for the number of parallel threads (with a single SFTP session per thread) used for network transfers.
The way it works now is that --jobs
is a limit for the number of asyncio coroutines fsspec will allow to be active at a time. This batch of active coroutines is then also throttled by the underlying filesystem, which should be using a connection pool of whatever size the filesystem implementation decides. In sshfs
, the way this is supposed to work is that we divide up any network requests between sessions in our pool, which is currently SFTPSoftChannelPool(max_sessions=10)
.
So in theory, it should still be safe to have a relatively high number of jobs. --jobs=128
doesn't mean that we try to open 128 simultaneous SFTP sessions, it means that we keep 128 queued requests at a time, that are actually only handled up to 10 at a time (via our session pool).
In practice, it may be that there is a problem with session pool implementations in sshfs where it doesn't properly handle cases where the # of active coroutines is larger than the pool size, which is why I initially suggested that we just use fs.config.max_sessions
as the allowed maximum for --jobs
.
The way it works now is that --jobs is a limit for the number of asyncio coroutines fsspec will allow to be active at a time. This batch of active coroutines is then also throttled by the underlying filesystem, which should be using a connection pool of whatever size the filesystem implementation decides. In sshfs, the way this is supposed to work is that we divide up any network requests between sessions in our pool, which is currently SFTPSoftChannelPool(max_sessions=10).
So in theory, it should still be safe to have a relatively high number of jobs. --jobs=128 does mean that we try to open 128 simultaneous SFTP sessions, it means that we keep 128 queued requests at a time, that are actually only handled up to 10 at a time (via our session pool).
Thanks for the explanation š
In practice, it may be that there is a problem with session pool implementations in sshfs where it doesn't properly handle cases where the # of active coroutines is larger than the pool size, which is why I initially suggested that we just use fs.config.max_sessions as the allowed maximum for --jobs.
The code appears to be 2 years untouched and we heavily changed the usage upstream, may be worth dedicating some time to review the implementation on top of that change
I think we should also consider exposing max_sessions
as an SSH remote config option as well (assuming that we look into fixing the pool behavior), given that there is a distinct difference between --jobs
and the session count now, and that the server-side limit is based on the session count
As per my earlier comment, the only semi-viable solution is to use the old version of dvc 2.41.1.
This makes the git hooks that invoke dvc push
automatically work, and those are installed by dvc.
However, the old version of dvc 2.41.1 now breaks the VSCode dvc plugin.
@drozzy Does the suggestion above to use dvc remote modify [--local] my-ssh-remote jobs 4
solve your issue? This should essentially match the behavior in 2.41.1.
@dberenbaum Yes, dvc remote modify my-ssh-remote jobs 4
fixed the issue.
Thank you.
There was a bug in the sshfs soft channel pool handling that caused this issue for cases where jobs
exceeded the server's MaxSession
count. This will be fixed in the next sshfs/dvc-ssh release.
After the fix, it should no longer be necessary for most users to set --jobs
for SSH remotes (and it will be safe to use the default number of jobs even with a high CPU core count). The soft channel pool will open as many channels as allowed by the server (up to the sshfs default of 10) and then divide up to --jobs
# of coroutines between available pool channels as expected.
I think it is still worth exposing max_sessions
to control the pool behavior. In some situations users may want to explicitly set this to a value lower than the server's MaxSessions
in order to ensure that some number of SSH sessions are not used by DVC (i.e. to leave some dedicated number of sessions available for user ssh
shell connections)
@pmrowla Thank you for looking into it! š„
This fix will be available in the next DVC release, in the meantime users using pip installations can also get the fix with
$ pip install dvc-ssh==2.22.1
Great work, thank you! Can confirm this fixes my case.
Bug Report
I open this issue as a follow up to https://github.com/iterative/dvc/issues/6138
Description
dvc push
raises an error when trying to push to an SFTP remote. It used to work with older versions. The SFTP remote I use is a personal Raspberry Pi server. I did not change anything on the server.Reproduce
Unfortunately, since I use a private server, I don't know whether it would be easy to reproduce.
After updating dvc, I tried to run
dvc push
and I got these logs:Expected
Since I did not change the configuration of my server and it used to work, I would expect
dvc push
to work.Environment information