iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.58k stars 1.17k forks source link

[dvc pull version >= 3.24] ERROR: unexpected error - failed to load directory #10030

Closed themaikelman closed 7 months ago

themaikelman commented 10 months ago

Bug Report

pull: ERROR, doesn't create the cache directory and crash.

Description

Collecting |0.00 [00:00, ?entry/s]

Fetching

ERROR: unexpected error - failed to load directory ('60', '36c8668869290419aec048f26f8deb.dir'): [Errno 2] No such file or directory: '/mnt/data2/users/myuser/project/.dvc/cache/files/md5/60/36c8668869290419aec048f26f8deb.dir'

With DVC version 3.23.0 it works OK! DVC >= 3.24 fails

Environment information

Output of dvc doctor:

$ dvc doctor

DVC version: 3.24.0 (pip)

-------------------------

Platform: Python 3.10.12 on Linux-5.15.0-84-generic-x86_64-with-glibc2.35

Subprojects:
        dvc_data = 2.18.1
        dvc_objects = 1.0.1
        dvc_render = 0.6.0
        dvc_task = 0.3.0
        scmrepo = 1.4.0

Supports:
        http (aiohttp = 3.8.6, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.6, aiohttp-retry = 2.8.3)

Config:
        Global: /mnt/data2/users/myuser/.config/dvc
        System: /etc/xdg/dvc

Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme2n1
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/nvme2n1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/fb1190580eb02ee00297b6211011d5af
efiop commented 10 months ago

@atekoa Please run dvc pull -v and post full log.

themaikelman commented 10 months ago
$ dvc pull -v

2023-10-18 12:16:18,044 DEBUG: v3.24.0 (pip), CPython 3.10.12 on Linux-5.15.0-84-generic-x86_64-with-glibc2.35
2023-10-18 12:16:18,044 DEBUG: command: /mnt/data2/users/myuser/project/env/bin/dvc pull -v
Collecting                                                                                                                                                                                  |0.00 [00:00,    ?entry/s]
2023-10-18 12:16:20,871 DEBUG: failed to load ('60', '36c8668869290419aec048f26f8deb.dir') from storage https ([https://my-http-remote?remote=5332/files/md5)](https://my-http-remote?remote=5332/files/md5)) - [https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir:](https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir:) 404, message='Not Found', url=URL('[https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir')](https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir%27))
Traceback (most recent call last):
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/implementations/http.py", line 414, in _info
    await _file_info(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/implementations/http.py", line 849, in _file_info
    r.raise_for_status()
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1011, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('[https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir')](https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir%27))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/spec.py", line 1297, in open
    self.open(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/spec.py", line 1309, in open
    f = self._open(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/implementations/http.py", line 353, in _open
    size = size or self.info(path, **kwargs)["size"]
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/fsspec/implementations/http.py", line 427, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: [https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir](https://my-http-remote?remote=5332/files/md5/60/36c8668869290419aec048f26f8deb.dir)

2023-10-18 12:16:20,873 DEBUG: failed to load ('60', '36c8668869290419aec048f26f8deb.dir') from storage local (/mnt/data2/users/myuser/project/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/mnt/data2/users/myuser/project/.dvc/cache/files/md5/60/36c8668869290419aec048f26f8deb.dir'
Traceback (most recent call last):
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data2/users/myuser/project/.dvc/cache/files/md5/60/36c8668869290419aec048f26f8deb.dir'

Fetching
2023-10-18 12:16:20,874 ERROR: unexpected error - failed to load directory ('60', '36c8668869290419aec048f26f8deb.dir'): [Errno 2] No such file or directory: '/mnt/data2/users/myuser/project/.dvc/cache/files/md5/60/36c8668869290419aec048f26f8deb.dir'
Traceback (most recent call last):
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data2/users/myuser/project/.dvc/cache/files/md5/60/36c8668869290419aec048f26f8deb.dir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 35, in run
    stats = self.repo.pull(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/repo/__init__.py", line 61, in wrapper
    return f(repo, *args, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/repo/pull.py", line 31, in pull
    processed_files_count = self.fetch(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/repo/__init__.py", line 61, in wrapper
    return f(repo, *args, **kwargs)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc/repo/fetch.py", line 164, in fetch
    fetch_transferred, fetch_failed = ifetch(
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/fetch.py", line 65, in fetch
    [
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/fetch.py", line 65, in <listcomp>
    [
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 691, in iteritems
    self._load(key, entry)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 647, in _load
    self.onerror(entry, exc)
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 579, in _onerror
    raise exc
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 645, in _load
    _load_from_storage(self._trie, entry, self.storage_map[key])
  File "/mnt/data2/users/myuser/project/env/lib/python3.10/site-packages/dvc_data/index/index.py", line 567, in _load_from_storage
    raise DataIndexDirError(f"failed to load directory {entry.key}") from last_exc
dvc_data.index.index.DataIndexDirError: failed to load directory ('60', '36c8668869290419aec048f26f8deb.dir')
kcchu commented 10 months ago

I encounter the same issue. When running dvc pull, it results in a similar error. But pulling individual directory with dvc pull FolderName would work.

Output of dvc doctor:

DVC version: 3.27.0 (brew)
--------------------------
Platform: Python 3.11.6 on macOS-13.6-arm64-arm-64bit
Subprojects:
    dvc_data = 2.18.1
    dvc_objects = 1.0.1
    dvc_render = 0.6.0
    dvc_task = 0.3.0
    scmrepo = 1.3.1
Supports:
    azure (adlfs = 2023.10.0, knack = 0.11.0, azure-identity = 1.14.1),
    gdrive (pydrive2 = 1.17.0),
    gs (gcsfs = 2023.9.1),
    http (aiohttp = 3.8.6, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.6, aiohttp-retry = 2.8.3),
    oss (ossfs = 2021.8.0),
    s3 (s3fs = 2023.9.1, boto3 = 1.28.17),
    ssh (sshfs = 2023.7.0),
    webdav (webdav4 = 0.9.8),
    webdavs (webdav4 = 0.9.8),
    webhdfs (fsspec = 2023.9.1)
Config:
    Global: /Users/kc/Library/Application Support/dvc
    System: /opt/homebrew/share/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /opt/homebrew/var/cache/dvc/repo/45bf98205c5909c54f3e921731589b5f
dberenbaum commented 10 months ago

I'm not able to reproduce it. @kcchu @atekoa Are either of you able to generate a simple reproducible example of the problem?

dberenbaum commented 10 months ago

Also, could you each try to drop everything in the Repo.site_cache_dir location and see if the problem persists?

davebulaval commented 10 months ago

I had the same problem. In my case, it was an error 18. I forgot to do dvc push from one server,; the message was "explicit" it was trying to pull some data that was not on the remote.

kcchu commented 10 months ago

Also, could you each try to drop everything in the Repo.site_cache_dir location and see if the problem persists?

Deleting the directory of Repo.site_cache_dir does resolve the problem.

dberenbaum commented 10 months ago

@efiop Should we try to bump the path for site_cache_dir so it forces a new cache for existing repos?

efiop commented 10 months ago

Need to look into why it is happening first, otherwise it will only obscure it till the next report.

kcchu commented 10 months ago

I didn't have chance to reproduce it but here are what I have done. Hope it help investigate.

PythonFZ commented 10 months ago

I've encountered the same issue. I tried to pull data but forgot to set the credentials to the S3 remote in config.local. After setting the credentials, I encountered the same error. The suggested fix could resvolve it.

EDIT: I only saw this issue and commented on it. This is my dvc doctor which is not the newest version

DVC version: 3.27.0 (pip)
-------------------------
Platform: Python 3.10.13 on Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 2.18.1
        dvc_objects = 1.0.1
        dvc_render = 0.6.0
        dvc_task = 0.3.0
        scmrepo = 1.4.0
Supports:
        http (aiohttp = 3.8.6, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.6, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.10.0, boto3 = 1.28.64)
Config:
        Global: /tikhome/fzills/.config/dvc
        System: /etc/xdg/dvc
dberenbaum commented 10 months ago

Need to look into why it is happening first, otherwise it will only obscure it till the next report.

@efiop I think it's an index issue we fixed already (maybe in https://github.com/iterative/dvc/issues/9785?) but for people who have an index that predates that change, they will still run into this issue, so forcing the index to be regenerated should resolve it. WDYT?

efiop commented 10 months ago

Hm, we've changed index cache key a few times since then, so it shouldn't pick it up in newer versions. It is possible we are handling this particular case badly, directiories that failed to load should not be marked as loaded thus we should be making an attempt to load them next time. Need to check, maybe there is another bug there somewhere...

PythonFZ commented 10 months ago

This also happens in our CI https://github.com/zincware/IPSuite/actions/runs/6825380182/job/18563548993?pr=218 see prepare repo.

dberenbaum commented 10 months ago

I'm still unclear and would like to hear from users watching this issue:

  1. Does deleting site_cache_dir fix the problem for everyone?
  2. Does the problem come back after that?
PythonFZ commented 10 months ago

I'm still unclear and would like to hear from users watching this issue:

  1. Does deleting site_cache_dir fix the problem for everyone?
  2. Does the problem come back after that?

For me, this is also happening in a GitHub CI runner. The site_cache_dir should only be created upon starting the runner. The CI was using url = https://dagshub.com/PythonFZ/IPS-Examples.dvc. Furthermore, I've seen this on a local machine with a S3 remote, but I can't excactly tell you how I fixed it then. Now I've just pinned the DVC version to 3.23.

dberenbaum commented 8 months ago

The CI was using url = https://dagshub.com/PythonFZ/IPS-Examples.dvc.

Could you give more detail on what's happening? How does that url get used in your CI script?

PythonFZ commented 8 months ago

This also happens in our CI https://github.com/zincware/IPSuite/actions/runs/6825380182/job/18563548993?pr=218 see prepare repo.

https://dagshub.com/PythonFZ/IPS-Examples contains some example scripts for our package IPSuite, which we test as follows

  run: |
    git clone https://github.com/PythonFZ/IPS-Examples
    cd IPS-Examples
    git fetch origin
    git checkout ${{ matrix.branch }}
    poetry run dvc pull
- name: run notebook
  run: |
    cd IPS-Examples
    poetry run jupyter nbconvert --to notebook --execute main.ipynb
- name: dvc repro
  run: |
    cd IPS-Examples
    poetry run dvc repro -f

I've updated the remote to S3, but prior to that it was using

['remote "origin"']
    url = https://dagshub.com/PythonFZ/IPS-Examples.dvc
efiop commented 8 months ago

@PythonFZ @themaikelman Mind giving 3.38.0 a try?

tonycusackData commented 7 months ago

@PythonFZ @themaikelman Mind giving 3.38.0 a try?

Was having the same issues above. Deleting everythin in Repo.site_cache_dir didn't work but this finally fixed it - pip install dvc[gs]==3.38.0

Thank you!

dberenbaum commented 7 months ago

Thanks for the feedback @tonycusackData! Let's close this then and we can reopen if needed.

sherwoac commented 4 months ago

WDYT

having the same issue, how do you force the index to rebuild?

thanks.

dberenbaum commented 4 months ago

@sherwoac dvc doctor will show you the location of the site_cache_dir. You can safely delete everything there, which will force dvc to rebuild the index and related temporary data.