iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.44k stars 1.17k forks source link

The whole metrics show fails if a single dir with metrics is missing #10018

Closed shcheklein closed 4 months ago

shcheklein commented 9 months ago

Take the recent example-get-started-experiments. Change in dvc.lock md5 for results/train directory that among other things has a metrics.json specified in it.

Run dvc metrics show --json:

WARNING: failed to load metrics in revision '', failed to load directory ('results', 'train')
DVC failed to load some metrics for following revisions: ''.
{
  "": {
    "error": {
      "type": "DataIndexDirError",
      "msg": "failed to load directory ('results', 'train')"
    }
  }
}

While there other metrics, and only one of them should be failing, not the whole command / collection.

It's throwing this exception:

2023-10-12 15:29:34,081 DEBUG: failed to load directory ('results', 'train'): https://remote.dvc.org/get-started-pools/files/md5/9e/33e5beef5ce777574de8a1f9aa6c71.dir: 404, message='Not Found', url=URL('https://s3-us-east-2.amazonaws.com/dvc-public/remote/get-started-pools/files/md5/9e/33e5beef5ce777574de8a1f9aa6c71.dir')
Traceback (most recent call last):
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/implementations/http.py", line 417, in _info
    await _file_info(
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/implementations/http.py", line 837, in _file_info
    r.raise_for_status()
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://s3-us-east-2.amazonaws.com/dvc-public/remote/get-started-pools/files/md5/9e/33e5beef5ce777574de8a1f9aa6c71.dir')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1229, in open
    self.open(
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1241, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/implementations/http.py", line 356, in _open
    size = size or self.info(path, **kwargs)["size"]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/implementations/http.py", line 430, in _info
    raise FileNotFoundError(url) from exc
FileNotFoundError: https://remote.dvc.org/get-started-pools/files/md5/9e/33e5beef5ce777574de8a1f9aa6c71.dir

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/experiments/collect.py", line 66, in collect_rev
    data = _collect_rev(
           ^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/experiments/collect.py", line 98, in _collect_rev
    return SerializableExp.from_repo(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/experiments/serialize.py", line 57, in from_repo
    params = _gather_params(repo, deps_only=param_deps, on_error="return")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/params/show.py", line 128, in _gather_params
    files_keypaths = _collect_params(
                     ^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/params/show.py", line 80, in _collect_params
    ret.update({file: _params for file in expand_paths(fs, [repo_path])})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/repo/params/show.py", line 80, in <dictcomp>
    ret.update({file: _params for file in expand_paths(fs, [repo_path])})
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/utils/__init__.py", line 423, in expand_paths
    if fs.isdir(path):
       ^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 191, in isdir
    return self.fs.isdir(path)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 689, in isdir
    return self.info(path)["type"] == "directory"
           ^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/fs/dvc.py", line 371, in info
    return self._info(key, path, ignore_subrepos=ignore_subrepos)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc/fs/dvc.py", line 381, in _info
    dvc_info = dvc_fs.fs.index.info(subkey)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 477, in info
    entry = self[key]
            ~~~~^^^^^
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 621, in __getitem__
    self._load(dir_key, dir_entry)
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 647, in _load
    self.onerror(entry, exc)
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 579, in _onerror
    raise exc
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 645, in _load
    _load_from_storage(self._trie, entry, self.storage_map[key])
  File "/Users/ivan/Projects/studio/.venv/lib/python3.11/site-packages/dvc_data/index/index.py", line 567, in _load_from_storage
    raise DataIndexDirError(f"failed to load directory {entry.key}") from last_exc
dvc_data.index.index.DataIndexDirError: failed to load directory ('results', 'train')

In Studio it's causing an unexpected error. In VS Code it also means we show the whole commit broken or even worse.

DVC Version

DVC version: 3.25.0 (pip)
-------------------------
Platform: Python 3.11.4 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
    dvc_data = 2.18.1
    dvc_objects = 1.0.1
    dvc_render = 0.6.0
    dvc_task = 0.3.0
    scmrepo = 1.3.1
Supports:
    azure (adlfs = 2023.8.0, knack = 0.11.0, azure-identity = 1.14.0),
    gdrive (pydrive2 = 1.17.0),
    gs (gcsfs = 2023.6.0),
    http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2023.6.0, boto3 = 1.26.76),
    ssh (sshfs = 2023.7.0)
Config:
    Global: /Users/ivan/Library/Application Support/dvc
    System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/55ddcdf6b41fadae3bcb27bf424ab29b
shcheklein commented 9 months ago

Getting the same for the gdrive fixture (fixing customer's issue).

Screenshot 2023-10-15 at 9 48 34 AM Screenshot 2023-10-15 at 9 48 55 AM
skshetry commented 4 months ago

This is kinda tricky to fix correctly in the DataFileSystem.

We know it's a directory due to .dir in the hash, so isdir should return True. But when you go to list it, it will fail since there is no actual .dir file.

The ideal fix here would be to try to load .dir file during isdir() calls, and raise FileNotFoundError if they are not found. But that might make everything slower, I assume. That fix would be in dvc_data.index.info()/dvc_data.index.ls().

On dvc side, we do need to protect ourselves from random errors, so we need to be careful when expanding paths. We can never be sure what we'll get during path expansion, so guarding makes sense to me.