iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.93k stars 1.19k forks source link

get / import: No storage files available #10572

Closed mbspng closed 1 week ago

mbspng commented 1 month ago

Bug Report

Description

I have tracked files in repo-a under data. dvc import and dvc get both fail when trying to get files from repo-a in repo-b.

Reproduce

I cloned my own repo (repo-a) under /tmp to test whether dvc pull works. It does. Then I checked status and remote:

[/tmp/repo-a] [master *]
-> % uv run dvc status -c
Cache and remote 'azure-blob' are in sync.      

[/tmp/repo-a] [master *]
-> % uv run dvc list --dvc-only .
data

So that is all correct.

Then I go to my repo-b. I configured the remote to be the same as the one of rebo-a. Here is the check:

[repo-b] [master *]
-> % diff .dvc/config.local /tmp/repo-a/.dvc/config.local | wc -l
0

Then I try to get the data from repo-a. It fails

[repo-b] [master *]
-> % uv run dvc list "git@gitlab.com:<org>/repo-a.git" --dvc-only
data      

[repo-b] [master *]
-> % uv run dvc get "git@gitlab.com:<org>/repo-a.git" "data" -v
2024-09-30 13:41:19,905 DEBUG: v3.55.2 (pip), CPython 3.10.14 on Linux-6.8.0-45-generic-x86_64-with-glibc2.35
2024-09-30 13:41:19,906 DEBUG: command: /.../repo-b/.venv/bin/dvc get git@gitlab.com:<org>/repo-a.git data -v
2024-09-30 13:41:19,985 DEBUG: Creating external repo git@gitlab.com:<org>/repo-a.git@None
2024-09-30 13:41:19,985 DEBUG: erepo: git clone 'git@gitlab.com:<org>/repo-a.git' to a temporary dir
2024-09-30 13:41:42,394 DEBUG: failed to load ('data', 'cvat', 'datumaro-dataset') from storage local (/tmp/tmpsuoa_qcgdvc-cache/files/md5) - [Errno 2] No such file or directory: '/tmp/tmpsuoa_qcgdvc-cache/files/md5/8a/6de34918ed22935e97644bf465f920.dir'
Traceback (most recent call last):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open
    return open(path, mode=mode, encoding=encoding)  # noqa: SIM115
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpsuoa_qcgdvc-cache/files/md5/8a/6de34918ed22935e97644bf465f920.dir'

2024-09-30 13:41:42,401 ERROR: unexpected error - failed to load directory ('data', 'cvat', 'datumaro-dataset'): [Errno 2] No such file or directory: '/tmp/tmpsuoa_qcgdvc-cache/files/md5/8a/6de34918ed22935e97644bf465f920.dir'                           
Traceback (most recent call last):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open
    return open(path, mode=mode, encoding=encoding)  # noqa: SIM115
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpsuoa_qcgdvc-cache/files/md5/8a/6de34918ed22935e97644bf465f920.dir'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/cli/command.py", line 41, in do_run
    return self.run()
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/commands/get.py", line 30, in run
    return self._get_file_from_repo()
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/commands/get.py", line 37, in _get_file_from_repo
    Repo.get(
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/repo/get.py", line 64, in get
    download(fs, fs_path, os.path.abspath(out), jobs=jobs)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/fs/__init__.py", line 67, in download
    return fs._get(fs_path, to, batch_size=jobs, callback=cb)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/fs/dvc.py", line 692, in _get
    return self.fs._get(
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/fs/dvc.py", line 543, in _get
    for root, dirs, files in self.walk(rpath, maxdepth=maxdepth, detail=True):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 468, in walk
    yield from self.walk(
  File "/.../repo-b/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 468, in walk
    yield from self.walk(
  File "/.../repo-b/.venv/lib/python3.10/site-packages/fsspec/spec.py", line 427, in walk
    listing = self.ls(path, detail=True, **kwargs)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc/fs/dvc.py", line 382, in ls
    for info in dvc_fs.ls(dvc_path, detail=True):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 519, in ls
    return self.fs.ls(path, detail=detail, **kwargs)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/fs.py", line 164, in ls
    for key, info in self.index.ls(root_key, detail=True):
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 764, in ls
    self._ensure_loaded(root_key)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 761, in _ensure_loaded
    self._load(prefix, entry)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 710, in _load
    self.onerror(entry, exc)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 638, in _onerror
    raise exc
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 708, in _load
    _load_from_storage(self._trie, entry, storage_info)
  File "/.../repo-b/.venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 626, in _load_from_storage
    raise DataIndexDirError(f"failed to load directory {entry.key}") from last_exc
dvc_data.index.index.DataIndexDirError: failed to load directory ('data', 'cvat', 'datumaro-dataset')

2024-09-30 13:41:42,432 DEBUG: Version info for developers:
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.8
Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.18.0),
        http (aiohttp = 3.10.8, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.8, aiohttp-retry = 2.8.3)
Config:
        Global: /.../.config/dvc
        System: /.../.config/kdedefaults/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/vgkubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/bdf5f37be5108aada94933a567e64744

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-09-30 13:41:42,433 DEBUG: Analytics is enabled.
2024-09-30 13:41:42,458 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpwllf0ijo', '-v']
2024-09-30 13:41:42,465 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpwllf0ijo', '-v'] with pid 111408
2024-09-30 13:41:42,466 DEBUG: Removing '/tmp/tmp5s8bt4cedvc-clone'
2024-09-30 13:41:42,495 DEBUG: Removing '/tmp/tmpsuoa_qcgdvc-cache'

Then I tried if I can push from repo-b. I can.

[repo-b] [master *]
-> % touch test

-> % uv run dvc push
Collecting
|1.00 [00:00,  234entry/s]
Pushing
1 file pushed

Same problem when I target a specific file:

[repo-b] [master *]

-> % uv run dvc get "git@gitlab.com:<org>/repo-a.git" "data/master-table.csv"
ERROR: unexpected error - [Errno 2] No storage files available: 'data/master-table.csv' 

But the file IS on the remote. I can pull it in the cloned repo-a.

Also, see this:

-> % uv run dvc get git@gitlab.com:<org>/repo-a.git data
ERROR: unexpected error - failed to load directory ('data', 'cvat', 'datumaro-dataset'): [Errno 2] No such file or directory: '/tmp/tmp_tgyr2ymdvc-cache/files/md5/8a/6de34918ed22935e97644bf465f920.dir'   

This file (files/md5/8a/6de34918ed22935e97644bf465f920.dir) DOES exist on the remote!

Environment information

-> % uv pip list G dvc
dvc                           3.55.2
dvc-data                      3.16.5
dvc-http                      2.32.0
dvc-objects                   5.1.0
dvc-render                    1.0.2
dvc-studio-client             0.21.0
dvc-task                      0.4.0

-> % uname  -a
Linux <name> 6.8.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

-> % python --version
Python 3.10.13

Output of dvc doctor:

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.8
Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.18.0),
        http (aiohttp = 3.10.8, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.8, aiohttp-retry = 2.8.3)
Config:
        Global: /home/mbs/.config/dvc
        System: /home/mbs/.config/kdedefaults/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/vgkubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/bdf5f37be5108aada94933a567e64744

I already deleted /var/tmp/dvc/. Did not help.

shcheklein commented 1 month ago

Do you use local config in repo_a to define the remote storage?

from storage local (/tmp/tmpsuoa_qcgdvc-cache/files/md5)

is it a correct path? is it defined somewhere?

mbspng commented 1 month ago

Do you use local config in repo_a to define the remote storage?

Hi, yes, I use .dvc/config.local. But I tried with .dvc/config also, and it is the same problem.

from storage local (/tmp/tmpsuoa_qcgdvc-cache/files/md5)

is it a correct path? is it defined somewhere?

Hm, I don't understand your question. That path comes from within the dvc module. I suppose it downloads things into a temporary directory.

shcheklein commented 1 month ago

But I tried with .dvc/config also, and it is the same problem.

was it committed into Git? did it have enough information alone to download objects from remote or you still had some local config or some other way to provide credentials for example?

Hm, I don't understand your question. That path comes from within the dvc module. I suppose it downloads things into a temporary directory.

👍

mbspng commented 1 month ago

I nuked the blob store container to make sure. I made a clean one and re-added the data with dvc push from repo-a. The data is clean and the .dvc files are on the GitLab server on a feature branch. dvc pull works without issues in the cloned version of repo-a (the one under /tmp). I am providing an azure blob storage connection string in .dvc/config(.local) of both repos. There are no other credentials used to access the storage.

mbspng commented 1 month ago

I added log statements into the dvc code:

    def _get_fs_path(self, path: "AnyFSPath", info=None) -> FileInfo:
        from .index import StorageKeyError

        info = info or self.info(path)
        if info["type"] == "directory":
            raise IsADirectoryError(errno.EISDIR, os.strerror(errno.EISDIR), path)

        entry: Optional[DataIndexEntry] = info["entry"]

        assert entry
        hash_info: Optional[HashInfo] = entry.hash_info

        for typ in ["cache", "remote", "data"]:
            try:
                info = self.index.storage_map[entry.key]
                storage = getattr(info, typ)
                if not storage:
                    logger.error("No %s storage for %s", typ, entry)  # <-- added
                    continue
                data = storage.get(entry)
            except (ValueError, StorageKeyError) as err:
                logger.error("Failed to get %s file from %s: %s", typ, storage, err)  # <-- added
                continue
            if data:
                fs, fs_path = data
                if fs.exists(fs_path):
                    return FileInfo(typ, storage, info.cache, hash_info, fs, fs_path)

        raise FileNotFoundError(errno.ENOENT, "No storage files available", path)

This is dvc_data.fs.DataFileSystem._ger_fs_path.

I get

2024-10-01 11:52:21,512 ERROR: No remote storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
2024-10-01 11:52:21,512 ERROR: No data storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
2024-10-01 11:52:21,519 ERROR: unexpected error - [Errno 2] No storage files available: 'data/master-table.csv'

for

uv run dvc get git@gitlab.com:<org>/repo-a.git data/master-table.csv -vvv

So it says remote=None. I do not know if that is significant here.

mbspng commented 1 month ago

So when I pause the dvc program with an Ipython.embed() like this:


    def _get_fs_path(self, path: "AnyFSPath", info=None) -> FileInfo:
        from .index import StorageKeyError

        info = info or self.info(path)
        if info["type"] == "directory":
            raise IsADirectoryError(errno.EISDIR, os.strerror(errno.EISDIR), path)

        entry: Optional[DataIndexEntry] = info["entry"]

        assert entry
        hash_info: Optional[HashInfo] = entry.hash_info

        for typ in ["cache", "remote", "data"]:
            try:
                info = self.index.storage_map[entry.key]
                storage = getattr(info, typ)
                if not storage:
                    logger.error("No %s storage for %s", typ, entry)
                    continue
                else:
                    logger.debug("%s storage for %s", typ, entry)
                data = storage.get(entry)
                print("data", data)
            except (ValueError, StorageKeyError) as err:

                logger.error("Failed to get %s file from %s: %s", typ, storage, err)

                continue
            if data:
                fs, fs_path = data
                print("fs_path", fs_path)
                print("fs", fs)
                import IPython; IPython.embed()
                if fs.exists(fs_path):
                    return FileInfo(typ, storage, info.cache, hash_info, fs, fs_path)

and I check on the temporary directory, I see that it is empty.

The print is fs_path /tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7

But

-> % tree /tmp/tmpdv7tn0ngdvc-cache/
/tmp/tmpdv7tn0ngdvc-cache/

0 directories, 0 files

More prints and logs I added:

2024-10-01 12:06:02,860 DEBUG: cache storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
data (<dvc_objects.fs.local.LocalFileSystem object at 0x700c1c19ff10>, '/tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7')
fs_path /tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7
fs <dvc_objects.fs.local.LocalFileSystem object at 0x700c1c19ff10>

It says remote=None.

mbspng commented 1 month ago

storage = getattr(info, typ) only results in a non-None assignment for typ=="cache". Where is that coming from? I suppose it should assign for typ=="remote"? But then also the entry record does not specify a remote for that, as shown above.

shcheklein commented 1 month ago

I am providing an azure blob storage connection string in .dvc/config(.local)

what does it exactly mean? Is it .dvc/config or .dvc/config.local?

can you try to do dvc get, but before that do export AZURE_STORAGE_CONNECTION_STRING='mysecret' ?

mbspng commented 1 month ago

export AZURE_STORAGE_CONNECTION_STRING=<the connection string> same error.

what does it exactly mean? Is it .dvc/config or .dvc/config.local?

I tried with both. First used the one, then the other. Same error.

shcheklein commented 1 month ago

okay, just to make sure we are running this on a clean env - have you tried to drop site_cache_dir for both repos (e.g. /var/tmp/dvc/repo/bdf5f37be5108aada94933a567e64744)?

also, when dvc get runs and does the clone to a temp dir. Can you get to that dir and try to run dvc fetch -v there? also dvc version, also dvc config --list --show-origin