iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

missing large file in remote storage after pushing #10448

Open xiaoFine opened 3 weeks ago

xiaoFine commented 3 weeks ago

Bug Report

push:large files are missing in remote storage

Description

after dvc push, large files (single file>20GB, ) are missing in the remote storge(AliyunOSS), while small files' md5 are successfully pushed and can be found in the oss path

Reproduce

dvc init -f
dvc remote add myoss oss://mybucket/path -d
dvc remote modify myoss oss_endpoint somepublicendpoint
dvc remote modify myoss oss_key_id xxxx
dvc remote modify myoss oss_key_secret xxxxxxxx

dvc add large-chkpoint.pt

dvc push

Expected

I can found large-chkpoint.pt md5 via oss dashboard

Environment information

Output of dvc doctor:

DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: oss
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553

Additional Information (if any):

output of pushing log

> dvc push -vvv 
2024-06-04 16:56:23,537 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-04 16:56:23,538 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -vvv
2024-06-04 16:56:23,538 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=9, targets=['triton/tensorrt_llm/1/rank0.engine'], remote='oss-qwen', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-04 16:56:23,758 TRACE:     1.31 ms in collecting stages from /ws
2024-06-04 16:56:23,758 TRACE:   253.99 mks in collecting stages from /ws
...

2024-06-04 16:56:23,773 DEBUG: Checking if stage 'large-chckpoint.pt' is in 'dvc.yaml'
Collecting                                                                                                                                                 |1.00 [00:00,  135entry/s]
2024-06-04 16:56:23,889 DEBUG: Preparing to transfer data from '/ws/.dvc/cache' to 'oss://mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Preparing to collect status from 'mybucket/path'
2024-06-04 16:56:23,889 DEBUG: Collecting status from 'mybucket/path'
2024-06-04 16:56:23,891 DEBUG: Querying 1 oids via object_exists
2024-06-04 16:56:24,228 DEBUG: Preparing to collect status from '/ws/.dvc/cache'                                                                       
2024-06-04 16:56:24,229 DEBUG: Collecting status from '/ws/.dvc/cache'                                                                                 
Pushing                                                                                                                                                                             /home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/ossfs/async_oss.py:389: RuntimeWarning: coroutine 'resumable_upload' was never awaited     0/1 [00:00<?,     ?file/s]
  await self._call_oss(
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Pushing
1 file pushed                                                                                                                                                                        
2024-06-04 16:56:24,292 DEBUG: Analytics is enabled.
2024-06-04 16:56:24,292 TRACE: Saving analytics report to /tmp/tmptx47o8pe
2024-06-04 16:56:24,354 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv']
2024-06-04 16:56:24,361 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmptx47o8pe', '-vv'] with pid 27977
2024-06-04 16:56:24,361 TRACE: Process 27869 exiting with 0

Does dvc-oss no longer maintain?

shcheklein commented 3 weeks ago

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

xiaoFine commented 3 weeks ago

Does dvc-oss no longer maintain?

It was primarily maintained by @karajan1001 . I would appreciate his input here.

As a workaround, could you try a S3 compatible interface - https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss ?

https://dvc.org/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

output of pushing log

Hmm, I don't see any details in the logs. Do you see any md5s / hashes for the files that are missing remotely? Is is the full log shared?

Could you try delete /var/tmp/dvc/repo/1dec9b5bdab7926326d2cb372ee9b553 and run the command again in a verbose mode?

I create an empty workspace with a large.bin and a small.txt , and delete all cache in `/car/tmp/dvc/repo here is the push log image

Only the small file can be found in remote

P.S. the RuntimeWarning won't show if only pushing small files

-- still working on S3 way with some compatible problem : ListObjectsV2 is called no matter listobjects is true or false

(dvcenv) admins@test-Ai-largemodel:/mnt/datadisk1/laien/ws-dvc$ dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 DEBUG: v3.51.2 (pip), CPython 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
2024-06-06 10:16:37,283 DEBUG: command: /home/admins/miniconda3/envs/dvcenv/bin/dvc push -r oss-s3 -vvv
2024-06-06 10:16:37,283 TRACE: Namespace(quiet=0, verbose=3, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='push', jobs=None, targets=[], remote='oss-s3', all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=True, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'dvc.cli.formatter.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2024-06-06 10:16:37,519 TRACE:    12.48 ms in collecting stages from /mnt/datadisk1/laien/ws-dvc
Collecting                                                                                                                                                                  |0.00 [00:00,    ?entry/s]
2024-06-06 10:16:37,542 DEBUG: Preparing to transfer data from '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5' to 's3://[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Preparing to collect status from '[remote-path]/dvc/files/md5'
2024-06-06 10:16:37,542 DEBUG: Collecting status from '[remote-path]/dvc/files/md5'
Pushing          '[remote-path]/dvc/files/md5'|                                                                                                                   |0/? [00:00<?,    ?files/s]
Pushing
2024-06-06 10:16:37,896 ERROR: unexpected error - The specified key does not exist.: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.        
Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 723, in _lsdir
    async for c in self._iterdir(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 773, in _iterdir
    async for i in it:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/paginate.py", line 30, in __anext__
    response = await self._make_request(current_kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the ListObjectsV2 operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 64, in run
    processed_files_count = self.repo.push(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc/repo/push.py", line 147, in push
    push_transferred, push_failed = ipush(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push
    result = transfer(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status
    dest_exists, dest_missing = status(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 423, in oids_exist
    remote_size, remote_oids = self._estimate_remote_size(
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 305, in _estimate_remote_size
    remote_oids = set(iter_with_pbar(oids))
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 295, in iter_with_pbar
    for oid in oids:
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 262, in _oids_with_limit
    for i, oid in enumerate(self._list_oids(prefixes=prefixes, jobs=jobs), start=1):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 816, in find
    yield from self.fs.find(path, prefix=prefix_str)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 848, in _find
    out = await self._lsdir(path, delimiter="", prefix=prefix, **kwargs)
  File "/home/admins/miniconda3/envs/dvcenv/lib/python3.10/site-packages/s3fs/core.py", line 736, in _lsdir
    raise translate_boto_error(e)
FileNotFoundError: The specified key does not exist.

2024-06-06 10:16:37,943 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/.nJecAj_lCq6vSRQavUKoxw.tmp'
2024-06-06 10:16:37,943 DEBUG: Removing '/mnt/datadisk1/laien/ws-dvc/.dvc/cache/files/md5/.YY-I1sW7eTcDot5sfhN07Q.tmp'
2024-06-06 10:16:37,949 DEBUG: Version info for developers:
DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.5.0, boto3 = 1.34.106)
Config:
        Global: /home/admins/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme1n1
Caches: local
Remotes: oss, s3
Workspace directory: ext4 on /dev/nvme1n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/4f9f0c30c341088cc84e9b8b312f7113

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-06-06 10:16:37,952 DEBUG: Analytics is enabled.
2024-06-06 10:16:37,952 TRACE: Saving analytics report to /tmp/tmphllltv1h
2024-06-06 10:16:37,993 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv']
2024-06-06 10:16:38,001 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmphllltv1h', '-vv'] with pid 115727
2024-06-06 10:16:38,002 TRACE: Process 115714 exiting with 255