Open tanguy-s opened 7 months ago
Does it freeze if you use less no. of jobs, say 8?
Btw, when using --show-stack
, you can use CTRL + \ to see the stack trace without exiting with Ctrl-C.
Hi @skshetry,
No, seems to be working only with -j 1
. As a note, default option for the number of jobs is 4 * cpu_count()
. I have 32 CPUs so using 32 is already quite lower than the default 128.
Using 1 job seems to disable the usage of async. But that makes the transfers very slow (~1.5file/s) vs the parallel mode.
My best guess is that this issue is related to s3fs/aiobotocore and most likely specifically to endpoint urls remotes.
@tanguy-s, so -j2
/-j4
does not work either?
I use minio server for testing without any issues, so I doubt it's an endpoint
url thing.
@skshetry testing with -j4
also leads to a hanging process for more than an hour, with no traffic :
root 718833 24.9 0.6 1351708 882948 ? Sl 14:45 16:29 /usr/bin/python3 /usr/local/bin/dvc push -j 4 --verbose -R pipeline/module/
Running sudo lsof -p 718833
:
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dvc 718833 root cwd DIR 0,77 4096 28455751 /root/workdir
dvc 718833 root rtd DIR 0,77 4096 28455747 /
dvc 718833 root txt REG 0,77 5904904 25066105 /usr/bin/python3.10
dvc 718833 root mem REG 259,2 25066105 /usr/bin/python3.10 (path inode=12213520)
dvc 718833 root mem REG 259,2 25196978 /usr/local/lib/python3.10/dist-packages/_ruamel_yaml.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25067085 /usr/lib/python3.10/lib-dynload/_posixshmem.cpython-310-x86_64-linux-gnu.so (path inode=12213947)
dvc 718833 root mem REG 259,2 26369519 /usr/local/lib/python3.10/dist-packages/wrapt/_wrappers.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25593675 /usr/local/lib/python3.10/dist-packages/frozenlist/_frozenlist.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25197144 /usr/local/lib/python3.10/dist-packages/aiohttp/_websocket.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25197140 /usr/local/lib/python3.10/dist-packages/aiohttp/_http_parser.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 26361892 /usr/local/lib/python3.10/dist-packages/psutil/_psutil_posix.abi3.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 22419094 /usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6 (path inode=12191632)
dvc 718833 root mem REG 259,2 25067087 /usr/lib/python3.10/lib-dynload/_sqlite3.cpython-310-x86_64-linux-gnu.so (path inode=12213949)
dvc 718833 root mem REG 259,2 25067098 /usr/lib/python3.10/lib-dynload/_zoneinfo.cpython-310-x86_64-linux-gnu.so (path inode=12213969)
dvc 718833 root mem REG 259,2 22416292 /usr/lib/x86_64-linux-gnu/libuuid.so.1.3.0 (path inode=12214405)
dvc 718833 root mem REG 259,2 26361891 /usr/local/lib/python3.10/dist-packages/psutil/_psutil_linux.abi3.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25852843 /usr/local/lib/python3.10/dist-packages/orjson/orjson.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25067062 /usr/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so (path inode=12213915)
dvc 718833 root mem REG 259,2 22416272 /usr/lib/x86_64-linux-gnu/libssl.so.3 (path inode=12197296)
dvc 718833 root mem REG 259,2 25067088 /usr/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so (path inode=12213943)
dvc 718833 root mem REG 259,2 22550314 /usr/lib/x86_64-linux-gnu/libmpdec.so.2.5.1 (path inode=12217455)
dvc 718833 root mem REG 259,2 25067077 /usr/lib/python3.10/lib-dynload/_decimal.cpython-310-x86_64-linux-gnu.so (path inode=12213935)
dvc 718833 root mem REG 259,2 22416216 /usr/lib/x86_64-linux-gnu/liblzma.so.5.2.5 (path inode=12217416)
dvc 718833 root mem REG 259,2 25067083 /usr/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so (path inode=12213946)
dvc 718833 root mem REG 259,2 25197142 /usr/local/lib/python3.10/dist-packages/aiohttp/_http_writer.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25197137 /usr/local/lib/python3.10/dist-packages/aiohttp/_helpers.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 26369595 /usr/local/lib/python3.10/dist-packages/yarl/_quoting_c.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25725057 /usr/local/lib/python3.10/dist-packages/multidict/_multidict.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 28468945 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db (stat: No such file or directory)
dvc 718833 root mem REG 259,2 28487270 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-shm (stat: No such file or directory)
dvc 718833 root mem REG 259,2 28487268 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-shm (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25299466 /usr/local/lib/python3.10/dist-packages/dulwich/_pack.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25299464 /usr/local/lib/python3.10/dist-packages/dulwich/_objects.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc 718833 root mem REG 259,2 25067100 /usr/lib/python3.10/lib-dynload/mmap.cpython-310-x86_64-linux-gnu.so (path inode=12213971)
dvc 718833 root mem REG 259,2 25067079 /usr/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so (path inode=12213938)
dvc 718833 root mem REG 259,2 22416173 /usr/lib/x86_64-linux-gnu/libcrypto.so.3 (path inode=12197292)
dvc 718833 root mem REG 259,2 25067078 /usr/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so (path inode=12213941)
dvc 718833 root mem REG 259,2 25067072 /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so (path inode=12213925)
dvc 718833 root mem REG 259,2 25061653 /usr/lib/x86_64-linux-gnu/libpthread.so.0 (path inode=12191860)
dvc 718833 root mem REG 259,2 25067095 /usr/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so (path inode=12213966)
dvc 718833 root mem REG 259,2 25067105 /usr/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so (path inode=12213980)
dvc 718833 root mem REG 259,2 25067081 /usr/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so (path inode=12213944)
dvc 718833 root mem REG 259,2 22416162 /usr/lib/x86_64-linux-gnu/libbz2.so.1.0.4 (path inode=12216769)
dvc 718833 root mem REG 259,2 25067104 /usr/lib/python3.10/lib-dynload/resource.cpython-310-x86_64-linux-gnu.so (path inode=12213979)
dvc 718833 root mem REG 259,2 25067063 /usr/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so (path inode=12213916)
dvc 718833 root mem REG 259,2 25067070 /usr/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so (path inode=12213923)
dvc 718833 root mem REG 259,2 22416183 /usr/lib/x86_64-linux-gnu/libffi.so.8.1.0 (path inode=12217006)
dvc 718833 root mem REG 259,2 25067084 /usr/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so (path inode=12213942)
dvc 718833 root mem REG 259,2 25067086 /usr/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so (path inode=12213948)
dvc 718833 root mem REG 259,2 22544753 /usr/lib/locale/locale-archive (path inode=12222227)
dvc 718833 root mem REG 259,2 25061492 /usr/lib/x86_64-linux-gnu/libc.so.6 (path inode=12191829)
dvc 718833 root mem REG 259,2 22416296 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11 (path inode=12218111)
dvc 718833 root mem REG 259,2 25068648 /usr/lib/x86_64-linux-gnu/libexpat.so.1.8.7 (path inode=12197320)
dvc 718833 root mem REG 259,2 25061594 /usr/lib/x86_64-linux-gnu/libm.so.6 (path inode=12191840)
dvc 718833 root mem REG 259,2 25061278 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache (path inode=12192208)
dvc 718833 root mem REG 259,2 25061292 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 (path inode=12191821)
dvc 718833 root 0r FIFO 0,14 0t0 1656898 pipe
dvc 718833 root 1w FIFO 0,14 0t0 1656899 pipe
dvc 718833 root 2w FIFO 0,14 0t0 1656900 pipe
dvc 718833 root 3ur REG 0,77 33943552 28456162 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db
dvc 718833 root 4u REG 0,77 12825592 28487263 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-wal
dvc 718833 root 5ur REG 0,77 32768 28487268 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-shm
dvc 718833 root 6ur REG 0,77 28672 28468945 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db
dvc 718833 root 7u REG 0,77 0 28487269 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-wal
dvc 718833 root 8ur REG 0,77 32768 28487270 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-shm
dvc 718833 root 9uW REG 0,77 7 28456164 /root/workdir/.dvc/tmp/lock
dvc 718833 root 10u a_inode 0,15 0 64 [eventpoll]
dvc 718833 root 11u sock 0,8 0t0 1660099 protocol: UNIX-STREAM
dvc 718833 root 12u sock 0,8 0t0 1660100 protocol: UNIX-STREAM
dvc 718833 root 13r REG 0,77 16512 28575616 /root/workdir/.dvc/cache/files/md5/c2/d0f57cee3984a828f52e9a418ee13f
dvc 718833 root 14r REG 0,77 16826942 28581712 /root/workdir/.dvc/cache/files/md5/42/4e5f600b527e959108b7b93ec2e8be
dvc 718833 root 15r REG 0,77 16512 28576756 /root/workdir/.dvc/cache/files/md5/3b/ed58c73ff7ea87f6ff481911b502a8
dvc 718833 root 16r REG 0,77 131152 28471540 /root/workdir/.dvc/cache/files/md5/a9/7b70eabe4ed843b5fa09849b3ab74d
dvc 718833 root 17u sock 0,8 0t0 1655300 protocol: TCP
dvc 718833 root 18u sock 0,8 0t0 1659235 protocol: TCP
dvc 718833 root 19u sock 0,8 0t0 1650439 protocol: TCP
dvc 718833 root 21u sock 0,8 0t0 1662291 protocol: TCP
Shows the 4 dvc cache files that are stuck and the 4 TCP sockets still opened.
I have tried reproducing on a new remote with :
for i in {1..2000}; do dd if=/dev/urandom of=datad-${i}.png bs=1M count=18; done
But, no luck it works fine.
On my original remote, it fails almost systematically.
Do you have any hints on how I could further debug ? how could I enable boto3, botocore, s3transfer logs ? Thank you
This issue is specific to a s3 compatible cloud remote (Scaleway) which is hanging forever on some transfers. I have switched back to aws which works fine.
However, there is potentially a remaining issue in the handling of these blocked transfers.
Bug Report
push: hangs after data transfer to s3 compatible remote
Description
When pushing files to s3 compatible storage with endpointurl with :
dvc push -vv --show-stack
ordvc push -j 32 -vv --show-stack
DVC intermittently hangs after pushing data to storage :
No verbose output after this point however, Ctrl+C gives systematically the following traceback :
Seems like some futures of the underlying S3FileSystem in dvc-objects are never returning and do not have a timeout.
Running
dvc push -j 1 -vv
works, however seems quite slower on the status collection (approx. 30min for 18k files).Environment information
x86 Ubuntu 22.04 Docker
Output of
dvc doctor
:Same isssue using latest 3.48 DVC.
Output of
dvc config -l
: