iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.89k stars 1.18k forks source link

`import`: pull performance #10059

Closed dberenbaum closed 2 months ago

dberenbaum commented 1 year ago

Copied from slack

I’m able to reproduce it using the aws sandbox:

$ git clone git@github.com:dberenbaum/download-dvc-dir.git
$ cd download-dvc-dir
$ dvc pull test2014

This pulls data imported from git@github.com:dberenbaum/coco-sample.git. When pulling directly from the source repo, it starts to pull fast, but pulling from download-dvc-dir gets stuck here for a long time:

$ dvc pull -vv test2014.dvc
2023-10-26 08:06:39,751 DEBUG: v3.27.1.dev6+g4a0d56a79.d20231020, CPython 3.11.5 on macOS-14.0-arm64-arm-64bit
2023-10-26 08:06:39,751 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc pull -vv test2014.dvc
2023-10-26 08:06:39,751 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=['test2014.dvc'], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-10-26 08:06:39,969 TRACE: params.yaml does not exist, it won't be used in parametrization
2023-10-26 08:06:39,971 TRACE:    16.24 ms in collecting stages from /Users/dave/Code/download-dvc-dir
2023-10-26 08:06:39,979 DEBUG: Creating external repo git@github.com:dberenbaum/coco-sample.git@ad247281096a07d3c3ea417617bf68ba491d16cb
2023-10-26 08:06:39,979 DEBUG: erepo: git clone 'git@github.com:dberenbaum/coco-sample.git' to a temporary dir
2023-10-26 08:06:42,722 TRACE:     1.91 ms in collecting stages from /
2023-10-26 08:06:42,723 TRACE:     6.08 mks in collecting stages from /annotations
2023-10-26 08:06:42,738 DEBUG: Creating external repo git@github.com:iterative/lstm_seq2seq@8aa13ed31971eae16e4148cc0cd2c62fa65c38d0
2023-10-26 08:06:42,738 DEBUG: erepo: git clone 'git@github.com:iterative/lstm_seq2seq' to a temporary dir
2023-10-26 08:06:46,391 TRACE: Context during resolution of stage download:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-10-26 08:06:46,481 TRACE: Context during resolution of stage train:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-10-26 08:06:46,482 TRACE:    95.37 ms in collecting stages from /
2023-10-26 08:06:46,482 TRACE:     1.63 mks in collecting stages from /.github
2023-10-26 08:06:46,482 TRACE:     1.63 mks in collecting stages from /.github/workflows
2023-10-26 08:06:46,482 TRACE:     2.25 mks in collecting stages from /conf
2023-10-26 08:06:46,482 TRACE:     1.83 mks in collecting stages from /conf/model
2023-10-26 08:06:46,482 TRACE:     2.67 mks in collecting stages from /results
Collecting                                                     |0.00 [00:06,    ?entry/s]
2023-10-26 08:06:47,627 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache' to '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:47,627 DEBUG: Preparing to collect status from '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:47,627 DEBUG: Collecting status from '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:48,586 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache/files/md5' to '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,586 DEBUG: Preparing to collect status from '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,586 DEBUG: Collecting status from '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,858 DEBUG: failed to load ('test2014',) from storage local (/Users/dave/Code/download-dvc-dir/.dvc/cache) - [Errno 2] No such file or directory: '/Users/dave/Code/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/Code/dvc-data/src/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/Code/dvc-data/src/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-data/src/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dave/Code/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

Fetching
efiop commented 12 months ago

For the record: Can reproduce even with small dataset from dvc-bench. Investigating further.

dberenbaum commented 12 months ago

@efiop Does the example above work for you? I'm seeing it get a little further but still get stuck on fetching:

$ dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 DEBUG: v3.30.1, CPython 3.11.5 on macOS-14.1-arm64-arm-64bit
2023-11-17 13:15:41,213 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=['test2014.dvc'], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-11-17 13:15:41,404 TRACE: params.yaml does not exist, it won't be used in parametrization
2023-11-17 13:15:41,406 TRACE:    16.60 ms in collecting stages from /private/tmp/download-dvc-dir
2023-11-17 13:15:41,414 DEBUG: Creating external repo git@github.com:dberenbaum/coco-sample.git@ad247281096a07d3c3ea417617bf68ba491d16cb
2023-11-17 13:15:41,414 DEBUG: erepo: git clone 'git@github.com:dberenbaum/coco-sample.git' to a temporary dir
2023-11-17 13:15:43,050 TRACE:     2.18 ms in collecting stages from /
2023-11-17 13:15:43,051 TRACE:     6.13 mks in collecting stages from /annotations
2023-11-17 13:15:43,062 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

2023-11-17 13:15:43,068 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

2023-11-17 13:15:53,634 DEBUG: Creating external repo git@github.com:iterative/lstm_seq2seq@8aa13ed31971eae16e4148cc0cd2c62fa65c38d0
2023-11-17 13:15:53,635 DEBUG: erepo: git clone 'git@github.com:iterative/lstm_seq2seq' to a temporary dir
2023-11-17 13:15:55,713 TRACE: Context during resolution of stage download:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,724 TRACE: Context during resolution of stage train:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,725 TRACE:    16.33 ms in collecting stages from /
2023-11-17 13:15:55,725 TRACE:     1.87 mks in collecting stages from /.github
2023-11-17 13:15:55,726 TRACE:     1.67 mks in collecting stages from /.github/workflows
2023-11-17 13:15:55,726 TRACE:     2.25 mks in collecting stages from /conf
2023-11-17 13:15:55,726 TRACE:     1.87 mks in collecting stages from /conf/model
2023-11-17 13:15:55,726 TRACE:     2.62 mks in collecting stages from /results
Collecting                                                   |40.8k [00:14, 2.85kentry/s]
2023-11-17 13:15:56,344 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache' to '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,344 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,345 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,823 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache/files/md5' to '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,823 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,824 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
Fetching
efiop commented 12 months ago

I've modified it to work with dvc-bench to make it quicker for me, but looks like I might've missed something. Let me try again.

efiop commented 12 months ago

So i was testing with a slightly different setup in a sense that the dataset in the data registry (not dvc-bench but derived local one) was a new one with hash: md5 field, while your coco-sample is an oldschool one, so Meta didn't know how to load md5-dos2unix properly, so this is kinda 3.x migration problem that we ran into here in addition to the one that got fixed. Working on a fix.

dberenbaum commented 11 months ago

@efiop Any status update on this?

efiop commented 11 months ago

We've discussed this, but for the record: the only thing left here is cross-hash compatibility, which I'm in no rush to implement as still waiting for user feedback on whether this was enough to fix it for them or not (can''t find a link yet, but will post if I do find it).

dberenbaum commented 2 months ago

Closing as stale