iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

missing large file in remote storage after pushing #10448

Open xiaoFine opened 5 months ago

xiaoFine commented 5 months ago

Bug Report

push:large files are missing in remote storage

Description

after dvc push, large files (single file>20GB, ) are missing in the remote storge(AliyunOSS), while small files' md5 are successfully pushed and can be found in the oss path

Reproduce

dvc init -f
dvc remote add myoss oss://mybucket/path -d
dvc remote modify myoss oss_endpoint somepublicendpoint
dvc remote modify myoss oss_key_id xxxx
dvc remote modify myoss oss_key_secret xxxxxxxx

dvc add large-chkpoint.pt

dvc push

Expected

I can found large-chkpoint.pt md5 via oss dashboard

Environment information

Output of dvc doctor:

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: oss
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553

Additional Information (if any):

output of pushing log

> dvc push -vvv 
2024-06-04 16:56:23,537 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-04 16:56:23,538 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -vvv
2024-06-04 16:56:23,538 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=9, targets=['triton/tensorrt_llm/1/rank0.engine'], remote='oss-qwen', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-04 16:56:23,758 TRACE:     1.31 ms in collecting stages from /ws
2024-06-04 16:56:23,758 TRACE:   253.99 mks in collecting stages from /ws
...

2024-06-04 16:56:23,773 DEBUG: Checking if stage 'large-chckpoint.pt' is in 'dvc.yaml'
Collecting                                                                                                                                                 |1.00 [00:00,  135entry/s]
2024-06-04 16:56:23,889 DEBUG: Preparing to transfer data from '/ws/.dvc/cache' to 'oss://mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Preparing to collect status from 'mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Collecting status from 'mybucket/path'
2024-06-04 16:56:23,891 DEBUG: Querying 1 oids via object_exists
2024-06-04 16:56:24,228 DEBUG: Preparing to collect status from '/ws/.dvc/cache'                                                                       
2024-06-04 16:56:24,229 DEBUG: Collecting status from '/ws/.dvc/cache'                                                                                 
Pushing                                                                                                                                                                             /home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/ossfs/async_oss.py:389: RuntimeWarning: coroutine 'resumable_upload' was never awaited     0/1 [00:00<?,     ?file/s]
  await self._call_oss(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Pushing
1 file pushed                                                                                                                                                                        
2024-06-04 16:56:24,292 DEBUG: Analytics is enabled.
2024-06-04 16:56:24,292 TRACE: Saving analytics report to /tmp/tmptx47o8pe
2024-06-04 16:56:24,354 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv']
2024-06-04 16:56:24,361 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv'] with pid 27977
2024-06-04 16:56:24,361 TRACE: Process 27869 exiting with 0

Does dvc-oss no longer maintain?

shcheklein commented 5 months ago

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

xiaoFine commented 5 months ago

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

I create an empty workspace with a large.bin and a small.txt , and delete all cache in `/car/tmp/dvc/repo here is the push log image

Only the small file can be found in remote

P.S. the RuntimeWarning won't show if only pushing small files

-- still working on S3 way with some compatible problem : ListObjectsV2 is called no matter listobjects is true or false

(dvcenv) admins@test-Ai-largemodel:/mnt/datadisk1/laien/ws-dvc$ dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-06 10:16:37,283 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=None, targets=[], remote='oss-s3', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-06 10:16:37,519 TRACE:    12.48 ms in collecting stages from /mnt/datadisk1/laien/ws-dvc
Collecting                                                                                                                                                                  |0.00 [00:00,    ?entry/s]
2024-06-06 10:16:37,542 DEBUG: Preparing to transfer data from '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5' to 's3://[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Preparing to collect status from '[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Collecting status from '[remote-path]/dvc/files/md5'
Pushing          '[remote-path]/dvc/files/md5'|                                                                                                                   |0/? [00:00<?,    ?files/s]
Pushing
2024-06-06 10:16:37,896 ERROR: unexpected error - The specified key does not exist.: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.        
Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 723, in _lsdir
    async for c in self._iterdir(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 773, in _iterdir
    async for i in it:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/paginate.py", line 30, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 64, in run
    processed_files_count = self.repo.push(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/push.py", line 147, in push
    push_transferred, push_failed = ipush(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push
    result = transfer(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
    dest_exists, dest_missing = status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 423, in oids_exist
    remote_size, remote_oids = self._estimate_remote_size(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 305, in _estimate_remote_size
    remote_oids = set(iter_with_pbar(oids))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 295, in iter_with_pbar
    for oid in oids:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 262, in _oids_with_limit
    for i, oid in enumerate(self._list_oids(prefixes=prefixes, jobs=jobs), start=1):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 816, in find
    yield from self.fs.find(path, prefix=prefix_str)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 848, in _find
    out = await self._lsdir(path, delimiter="", prefix=prefix, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 736, in _lsdir
    raise translate_boto_error(e)
FileNotFoundError: The specified key does not exist.

2024-06-06 10:16:37,943 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5/.YY-I1sW7eTcDot5sfhN07Q.tmp'
2024-06-06 10:16:37,949 DEBUG: Version info for developers:
DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.5.0, boto3 = 1.34.106)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme1n1
Caches: local
Remotes: oss, s3
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/4f9f0c30c341088cc84e9b8b312f7113

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-06-06 10:16:37,952 DEBUG: Analytics is enabled.
2024-06-06 10:16:37,952 TRACE: Saving analytics report to /tmp/tmphllltv1h
2024-06-06 10:16:37,993 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv']
2024-06-06 10:16:38,001 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv'] with pid 115727
2024-06-06 10:16:38,002 TRACE: Process 115714 exiting with 255
Wangsongming commented 2 months ago

I also have this problem, it seems that large files are uploaded using sharding when transferring OSS, but it is directly ended without waiting for the return, resulting in the situation of large files that have been unable to pass up, I hope to solve it as soon as possible

skshetry commented 2 months ago

Could be related to https://github.com/fsspec/ossfs/pull/129. Please file a bug upstream.

Wangsongming commented 2 months ago

我也出现这个问题,想在传输OSS的时候使用sharding上传大文件,但是没有等待返回就直接结束了,导致出现大文件一直传不上去的情况,希望尽快解决

Collecting |2.00 [00:00, 250entry/s] Pushing D:\python\lib\site-packages\ossfs\async_oss.py:388: RuntimeWarning: coroutine 'resumable_upload' was never awaitedile/s] await self._call_oss( RuntimeWarning: Enable tracemalloc to get the object allocation traceback Pushing 2 files pushed

Wangsongming commented 2 months ago

Member

I think this is a bug in the dvc-oss plugin

ws1336 commented 3 weeks ago

Has this problem been resolved?