iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.89k stars 1.19k forks source link

push: hangs after data transfer to s3 compatible remote #10374

Open tanguy-s opened 7 months ago

tanguy-s commented 7 months ago

Bug Report

push: hangs after data transfer to s3 compatible remote

Description

When pushing files to s3 compatible storage with endpointurl with : dvc push -vv --show-stack or dvc push -j 32 -vv --show-stack

DVC intermittently hangs after pushing data to storage :

2024-03-26 10:56:55,858 DEBUG: Preparing to transfer data from '/root/workdir/.dvc/cache' to 's3://my-s3-bucket/'
2024-03-26 10:56:55,858 DEBUG: Preparing to collect status from 'my-s3-bucket/'
2024-03-26 10:56:55,858 DEBUG: Collecting status from 'my-s3-bucket/'   
2024-03-26 10:56:55,860 DEBUG: Querying 3 oids via object_exists                 
2024-03-26 10:56:56,544 DEBUG: Querying 0 oids via object_exists                 
2024-03-26 10:56:57,582 DEBUG: Preparing to transfer data from '/root/workdir/.dvc/cache/files/md5' to 's3://my-s3-bucket/files/md5'                   
2024-03-26 10:56:57,583 DEBUG: Preparing to collect status from 'my-s3-bucket/files/md5'
2024-03-26 10:56:57,591 DEBUG: Collecting status from 'my-s3-bucket/files/md5'
2024-03-26 10:56:57,592 DEBUG: Querying 2 oids via object_exists                 
2024-03-26 10:56:58,639 DEBUG: Estimated remote size: 749568 files               
2024-03-26 10:56:58,639 DEBUG: Querying 28882 oids via traverse                  
2024-03-26 10:58:02,859 DEBUG: Preparing to collect status from '/root/workdir/.dvc/cache/files/md5'                                                            
2024-03-26 10:58:02,866 DEBUG: Collecting status from '/root/workdir/.dvc/cache/files/md5'
2024-03-26 10:58:03,559 DEBUG: transfer dir: md5: e73f784c9d7a9d79aa8ddbdef314e12d.dir with 18343 files                                                           
Pushing                                                |0.00 [01:07,     ?file/s]
100%|█████████▉|Pushing to s3              18.3k/18.3k [19:42<00:00,  19.8file/s]

No verbose output after this point however, Ctrl+C gives systematically the following traceback :

2024-03-26 12:00:14,267 ERROR: interrupted by the user                                             
Traceback (most recent call last):                                                                 
  File "/usr/local/lib/python3.10/dist-packages/dvc/repo/push.py", line 144, in push
    push_transferred, push_failed = ipush(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/index/push.py", line 84, in push
    result = transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 224, in transfer
    failed = _do_transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 93, in _do_transfer
    dir_fails = _add(src, dest, bound_file_ids, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/transfer.py", line 165, in _add
    dest.add(
  File "/usr/local/lib/python3.10/dist-packages/dvc_data/hashfile/db/__init__.py", line 111, in add
    transferred = super().add(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/db.py", line 188, in add
    generic.transfer(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 319, in transfer
    copy(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 87, in copy
    return _put(
  File "/usr/local/lib/python3.10/dist-packages/dvc_objects/fs/generic.py", line 172, in _put
    for i, result in enumerate(fut.result()):
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 453, in result
    self._condition.wait(timeout)
  File "/usr/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt

Seems like some futures of the underlying S3FileSystem in dvc-objects are never returning and do not have a timeout.

Running dvc push -j 1 -vv works, however seems quite slower on the status collection (approx. 30min for 18k files).

Environment information

x86 Ubuntu 22.04 Docker

aiobotocore==2.12.1
awscli==1.32.51
awscli-plugin-endpoint==0.4
boto3==1.34.51
botocore==1.34.51
dvc==3.36.0
dvc-data==3.2.0
dvc-http==2.32.0
dvc-objects==3.0.6
dvc-render==1.0.1
dvc-s3==3.1.0
dvc-studio-client==0.20.0
dvc-task==0.4.0
fsspec==2024.3.1
s3fs==2024.3.1
s3transfer==0.10.1

Output of dvc doctor:

DVC version: 3.36.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-26-generic-x86_64-with-glibc2.35
Subprojects:
    dvc_data = 3.2.0
    dvc_objects = 3.0.6
    dvc_render = 1.0.1
    dvc_task = 0.4.0
    scmrepo = 2.1.1
Supports:
    http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2024.3.1, boto3 = 1.34.51)
Config:
    Global: /root/.config/dvc
    System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3, s3
Workspace directory: overlay on overlay
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/da3f4f6485fee7c550b5a6ccb3e96e47

Same isssue using latest 3.48 DVC.

Output of dvc config -l:

remote.dvc-cache.url=s3://my-s3-bucket/
remote.dvc-cache.endpointurl=https://endpoint.url
remote.dvc-cache.profile=scw
remote.scans.url=s3://my-s3-bucket-2/
remote.scans.endpointurl=https://endpoint.url
remote.scans.profile=scw
cache.type=hardlink
core.remote=dvc-cache
core.hardlink_lock=false
skshetry commented 7 months ago

Does it freeze if you use less no. of jobs, say 8?

skshetry commented 7 months ago

Btw, when using --show-stack, you can use CTRL + \ to see the stack trace without exiting with Ctrl-C.

tanguy-s commented 7 months ago

Hi @skshetry,

No, seems to be working only with -j 1. As a note, default option for the number of jobs is 4 * cpu_count(). I have 32 CPUs so using 32 is already quite lower than the default 128.

Using 1 job seems to disable the usage of async. But that makes the transfers very slow (~1.5file/s) vs the parallel mode.

My best guess is that this issue is related to s3fs/aiobotocore and most likely specifically to endpoint urls remotes.

skshetry commented 7 months ago

@tanguy-s, so -j2/-j4 does not work either?

skshetry commented 7 months ago

I use minio server for testing without any issues, so I doubt it's an endpoint url thing.

tanguy-s commented 7 months ago

@skshetry testing with -j4 also leads to a hanging process for more than an hour, with no traffic : root 718833 24.9 0.6 1351708 882948 ? Sl 14:45 16:29 /usr/bin/python3 /usr/local/bin/dvc push -j 4 --verbose -R pipeline/module/

Running sudo lsof -p 718833:

lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
COMMAND    PID USER   FD      TYPE DEVICE SIZE/OFF     NODE NAME
dvc     718833 root  cwd       DIR   0,77     4096 28455751 /root/workdir
dvc     718833 root  rtd       DIR   0,77     4096 28455747 /
dvc     718833 root  txt       REG   0,77  5904904 25066105 /usr/bin/python3.10
dvc     718833 root  mem       REG  259,2          25066105 /usr/bin/python3.10 (path inode=12213520)
dvc     718833 root  mem       REG  259,2          25196978 /usr/local/lib/python3.10/dist-packages/_ruamel_yaml.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25067085 /usr/lib/python3.10/lib-dynload/_posixshmem.cpython-310-x86_64-linux-gnu.so (path inode=12213947)
dvc     718833 root  mem       REG  259,2          26369519 /usr/local/lib/python3.10/dist-packages/wrapt/_wrappers.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25593675 /usr/local/lib/python3.10/dist-packages/frozenlist/_frozenlist.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25197144 /usr/local/lib/python3.10/dist-packages/aiohttp/_websocket.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25197140 /usr/local/lib/python3.10/dist-packages/aiohttp/_http_parser.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          26361892 /usr/local/lib/python3.10/dist-packages/psutil/_psutil_posix.abi3.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          22419094 /usr/lib/x86_64-linux-gnu/libsqlite3.so.0.8.6 (path inode=12191632)
dvc     718833 root  mem       REG  259,2          25067087 /usr/lib/python3.10/lib-dynload/_sqlite3.cpython-310-x86_64-linux-gnu.so (path inode=12213949)
dvc     718833 root  mem       REG  259,2          25067098 /usr/lib/python3.10/lib-dynload/_zoneinfo.cpython-310-x86_64-linux-gnu.so (path inode=12213969)
dvc     718833 root  mem       REG  259,2          22416292 /usr/lib/x86_64-linux-gnu/libuuid.so.1.3.0 (path inode=12214405)
dvc     718833 root  mem       REG  259,2          26361891 /usr/local/lib/python3.10/dist-packages/psutil/_psutil_linux.abi3.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25852843 /usr/local/lib/python3.10/dist-packages/orjson/orjson.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25067062 /usr/lib/python3.10/lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so (path inode=12213915)
dvc     718833 root  mem       REG  259,2          22416272 /usr/lib/x86_64-linux-gnu/libssl.so.3 (path inode=12197296)
dvc     718833 root  mem       REG  259,2          25067088 /usr/lib/python3.10/lib-dynload/_ssl.cpython-310-x86_64-linux-gnu.so (path inode=12213943)
dvc     718833 root  mem       REG  259,2          22550314 /usr/lib/x86_64-linux-gnu/libmpdec.so.2.5.1 (path inode=12217455)
dvc     718833 root  mem       REG  259,2          25067077 /usr/lib/python3.10/lib-dynload/_decimal.cpython-310-x86_64-linux-gnu.so (path inode=12213935)
dvc     718833 root  mem       REG  259,2          22416216 /usr/lib/x86_64-linux-gnu/liblzma.so.5.2.5 (path inode=12217416)
dvc     718833 root  mem       REG  259,2          25067083 /usr/lib/python3.10/lib-dynload/_multiprocessing.cpython-310-x86_64-linux-gnu.so (path inode=12213946)
dvc     718833 root  mem       REG  259,2          25197142 /usr/local/lib/python3.10/dist-packages/aiohttp/_http_writer.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25197137 /usr/local/lib/python3.10/dist-packages/aiohttp/_helpers.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          26369595 /usr/local/lib/python3.10/dist-packages/yarl/_quoting_c.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25725057 /usr/local/lib/python3.10/dist-packages/multidict/_multidict.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          28468945 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          28487270 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-shm (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          28487268 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-shm (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25299466 /usr/local/lib/python3.10/dist-packages/dulwich/_pack.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25299464 /usr/local/lib/python3.10/dist-packages/dulwich/_objects.cpython-310-x86_64-linux-gnu.so (stat: No such file or directory)
dvc     718833 root  mem       REG  259,2          25067100 /usr/lib/python3.10/lib-dynload/mmap.cpython-310-x86_64-linux-gnu.so (path inode=12213971)
dvc     718833 root  mem       REG  259,2          25067079 /usr/lib/python3.10/lib-dynload/_json.cpython-310-x86_64-linux-gnu.so (path inode=12213938)
dvc     718833 root  mem       REG  259,2          22416173 /usr/lib/x86_64-linux-gnu/libcrypto.so.3 (path inode=12197292)
dvc     718833 root  mem       REG  259,2          25067078 /usr/lib/python3.10/lib-dynload/_hashlib.cpython-310-x86_64-linux-gnu.so (path inode=12213941)
dvc     718833 root  mem       REG  259,2          25067072 /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so (path inode=12213925)
dvc     718833 root  mem       REG  259,2          25061653 /usr/lib/x86_64-linux-gnu/libpthread.so.0 (path inode=12191860)
dvc     718833 root  mem       REG  259,2          25067095 /usr/lib/python3.10/lib-dynload/_uuid.cpython-310-x86_64-linux-gnu.so (path inode=12213966)
dvc     718833 root  mem       REG  259,2          25067105 /usr/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so (path inode=12213980)
dvc     718833 root  mem       REG  259,2          25067081 /usr/lib/python3.10/lib-dynload/_lzma.cpython-310-x86_64-linux-gnu.so (path inode=12213944)
dvc     718833 root  mem       REG  259,2          22416162 /usr/lib/x86_64-linux-gnu/libbz2.so.1.0.4 (path inode=12216769)
dvc     718833 root  mem       REG  259,2          25067104 /usr/lib/python3.10/lib-dynload/resource.cpython-310-x86_64-linux-gnu.so (path inode=12213979)
dvc     718833 root  mem       REG  259,2          25067063 /usr/lib/python3.10/lib-dynload/_bz2.cpython-310-x86_64-linux-gnu.so (path inode=12213916)
dvc     718833 root  mem       REG  259,2          25067070 /usr/lib/python3.10/lib-dynload/_contextvars.cpython-310-x86_64-linux-gnu.so (path inode=12213923)
dvc     718833 root  mem       REG  259,2          22416183 /usr/lib/x86_64-linux-gnu/libffi.so.8.1.0 (path inode=12217006)
dvc     718833 root  mem       REG  259,2          25067084 /usr/lib/python3.10/lib-dynload/_opcode.cpython-310-x86_64-linux-gnu.so (path inode=12213942)
dvc     718833 root  mem       REG  259,2          25067086 /usr/lib/python3.10/lib-dynload/_queue.cpython-310-x86_64-linux-gnu.so (path inode=12213948)
dvc     718833 root  mem       REG  259,2          22544753 /usr/lib/locale/locale-archive (path inode=12222227)
dvc     718833 root  mem       REG  259,2          25061492 /usr/lib/x86_64-linux-gnu/libc.so.6 (path inode=12191829)
dvc     718833 root  mem       REG  259,2          22416296 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11 (path inode=12218111)
dvc     718833 root  mem       REG  259,2          25068648 /usr/lib/x86_64-linux-gnu/libexpat.so.1.8.7 (path inode=12197320)
dvc     718833 root  mem       REG  259,2          25061594 /usr/lib/x86_64-linux-gnu/libm.so.6 (path inode=12191840)
dvc     718833 root  mem       REG  259,2          25061278 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache (path inode=12192208)
dvc     718833 root  mem       REG  259,2          25061292 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 (path inode=12191821)
dvc     718833 root    0r     FIFO   0,14      0t0  1656898 pipe
dvc     718833 root    1w     FIFO   0,14      0t0  1656899 pipe
dvc     718833 root    2w     FIFO   0,14      0t0  1656900 pipe
dvc     718833 root    3ur     REG   0,77 33943552 28456162 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db
dvc     718833 root    4u      REG   0,77 12825592 28487263 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-wal
dvc     718833 root    5ur     REG   0,77    32768 28487268 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/data/db.db-shm
dvc     718833 root    6ur     REG   0,77    28672 28468945 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db
dvc     718833 root    7u      REG   0,77        0 28487269 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-wal
dvc     718833 root    8ur     REG   0,77    32768 28487270 /var/tmp/dvc/repo/90499dc9249256d3fea64586df2e42b3/index/a50a811671ae05106862988d9a234623086b17ce38174f6ba50dc7d2a8686509/cache.db-shm
dvc     718833 root    9uW     REG   0,77        7 28456164 /root/workdir/.dvc/tmp/lock
dvc     718833 root   10u  a_inode   0,15        0       64 [eventpoll]
dvc     718833 root   11u     sock    0,8      0t0  1660099 protocol: UNIX-STREAM
dvc     718833 root   12u     sock    0,8      0t0  1660100 protocol: UNIX-STREAM
dvc     718833 root   13r      REG   0,77    16512 28575616 /root/workdir/.dvc/cache/files/md5/c2/d0f57cee3984a828f52e9a418ee13f
dvc     718833 root   14r      REG   0,77 16826942 28581712 /root/workdir/.dvc/cache/files/md5/42/4e5f600b527e959108b7b93ec2e8be
dvc     718833 root   15r      REG   0,77    16512 28576756 /root/workdir/.dvc/cache/files/md5/3b/ed58c73ff7ea87f6ff481911b502a8
dvc     718833 root   16r      REG   0,77   131152 28471540 /root/workdir/.dvc/cache/files/md5/a9/7b70eabe4ed843b5fa09849b3ab74d
dvc     718833 root   17u     sock    0,8      0t0  1655300 protocol: TCP
dvc     718833 root   18u     sock    0,8      0t0  1659235 protocol: TCP
dvc     718833 root   19u     sock    0,8      0t0  1650439 protocol: TCP
dvc     718833 root   21u     sock    0,8      0t0  1662291 protocol: TCP

Shows the 4 dvc cache files that are stuck and the 4 TCP sockets still opened.

I have tried reproducing on a new remote with : for i in {1..2000}; do dd if=/dev/urandom of=datad-${i}.png bs=1M count=18; done But, no luck it works fine.

On my original remote, it fails almost systematically.

Do you have any hints on how I could further debug ? how could I enable boto3, botocore, s3transfer logs ? Thank you

tanguy-s commented 6 months ago

This issue is specific to a s3 compatible cloud remote (Scaleway) which is hanging forever on some transfers. I have switched back to aws which works fine.

However, there is potentially a remaining issue in the handling of these blocked transfers.