Closed dberenbaum closed 2 months ago
For the record: Can reproduce even with small
dataset from dvc-bench. Investigating further.
@efiop Does the example above work for you? I'm seeing it get a little further but still get stuck on fetching:
$ dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 DEBUG: v3.30.1, CPython 3.11.5 on macOS-14.1-arm64-arm-64bit
2023-11-17 13:15:41,213 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=['test2014.dvc'], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-11-17 13:15:41,404 TRACE: params.yaml does not exist, it won't be used in parametrization
2023-11-17 13:15:41,406 TRACE: 16.60 ms in collecting stages from /private/tmp/download-dvc-dir
2023-11-17 13:15:41,414 DEBUG: Creating external repo git@github.com:dberenbaum/coco-sample.git@ad247281096a07d3c3ea417617bf68ba491d16cb
2023-11-17 13:15:41,414 DEBUG: erepo: git clone 'git@github.com:dberenbaum/coco-sample.git' to a temporary dir
2023-11-17 13:15:43,050 TRACE: 2.18 ms in collecting stages from /
2023-11-17 13:15:43,051 TRACE: 6.13 mks in collecting stages from /annotations
2023-11-17 13:15:43,062 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
_load_from_object_storage(trie, entry, storage)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
with obj.fs.open(obj.path, "r") as fobj:
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
return self.fs.open(path, mode=mode, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
return open(path, mode=mode, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
2023-11-17 13:15:43,068 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
_load_from_object_storage(trie, entry, storage)
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
with obj.fs.open(obj.path, "r") as fobj:
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
return self.fs.open(path, mode=mode, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
return open(path, mode=mode, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
2023-11-17 13:15:53,634 DEBUG: Creating external repo git@github.com:iterative/lstm_seq2seq@8aa13ed31971eae16e4148cc0cd2c62fa65c38d0
2023-11-17 13:15:53,635 DEBUG: erepo: git clone 'git@github.com:iterative/lstm_seq2seq' to a temporary dir
2023-11-17 13:15:55,713 TRACE: Context during resolution of stage download:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,724 TRACE: Context during resolution of stage train:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,725 TRACE: 16.33 ms in collecting stages from /
2023-11-17 13:15:55,725 TRACE: 1.87 mks in collecting stages from /.github
2023-11-17 13:15:55,726 TRACE: 1.67 mks in collecting stages from /.github/workflows
2023-11-17 13:15:55,726 TRACE: 2.25 mks in collecting stages from /conf
2023-11-17 13:15:55,726 TRACE: 1.87 mks in collecting stages from /conf/model
2023-11-17 13:15:55,726 TRACE: 2.62 mks in collecting stages from /results
Collecting |40.8k [00:14, 2.85kentry/s]
2023-11-17 13:15:56,344 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache' to '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,344 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,345 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,823 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache/files/md5' to '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,823 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,824 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
Fetching
I've modified it to work with dvc-bench to make it quicker for me, but looks like I might've missed something. Let me try again.
So i was testing with a slightly different setup in a sense that the dataset in the data registry (not dvc-bench but derived local one) was a new one with hash: md5
field, while your coco-sample is an oldschool one, so Meta
didn't know how to load md5-dos2unix
properly, so this is kinda 3.x migration problem that we ran into here in addition to the one that got fixed. Working on a fix.
@efiop Any status update on this?
We've discussed this, but for the record: the only thing left here is cross-hash compatibility, which I'm in no rush to implement as still waiting for user feedback on whether this was enough to fix it for them or not (can''t find a link yet, but will post if I do find it).
Closing as stale
Copied from slack
I’m able to reproduce it using the aws sandbox:
This pulls data imported from git@github.com:dberenbaum/coco-sample.git. When pulling directly from the source repo, it starts to pull fast, but pulling from download-dvc-dir gets stuck here for a long time: