iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.93k stars 1.19k forks source link

dvc.api.open: Fails with paths relative to repository root directory #8682

Open francoispichot1 opened 1 year ago

francoispichot1 commented 1 year ago

Bug Report

Description

With a current working directory being a subdirectory of a dvc repository root directory, trying to open a file with dvc.api.open('path_relative_to_root_directory', repo='path_to_repository') fails with a FileMissingError.

Reproduce

  1. Within a dvc repository, create a new directory test
  2. Create test/titi.txt
  3. dvc add test && dvc push
  4. cd test
  5. Run the following python script:
    
    from dvc.api import open

with open(path='test/titi.txt', repo='path_to_repository') as file: print(file.read())


<!--
Step list of how to reproduce the bug
-->

<!--
Example:

1. dvc init
2. Copy dataset.zip to the directory
7. dvc add dataset.zip
8. dvc run -d dataset.zip -o model ./train.sh
9. modify dataset.zip
10. dvc repro
-->

### Error

```python
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/index/index.py:134, in BaseDataIndex.info(self, key)
    133 try:
--> 134     entry = self[key]
    135     isdir = entry.meta and entry.meta.isdir

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/index/index.py:179, in DataIndex.__getitem__(self, key)
    178 self._load(dir_key, dir_entry)
--> 179 return self._trie[key]

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/pygtrie.py:937, in Trie.__getitem__(self, key_or_slice)
    936     return self.itervalues(key_or_slice.start)
--> 937 node, _ = self._get_node(key_or_slice)
    938 if node.value is _EMPTY:

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/pygtrie.py:630, in Trie._get_node(self, key)
    629 if node is None:
--> 630     raise KeyError(key)
    631 trace.append((step, node))

KeyError: ('test', 'test', 'titi.txt')

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc/repo/__init__.py:524, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    522     fs_path = remote_odb.oid_to_path(oid)
--> 524 with fs.open(
    525     fs_path,
    526     mode=mode,
    527     encoding=encoding,
    528 ) as fobj:
    529     yield fobj

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_objects/fs/base.py:197, in FileSystem.open(self, path, mode, **kwargs)
    196     kwargs.pop("encoding", None)
--> 197 return self.fs.open(path, mode=mode, **kwargs)

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/fsspec/spec.py:1094, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1088     text_kwargs = {
   1089         k: kwargs.pop(k)
   1090         for k in ["encoding", "errors", "newline"]
   1091         if k in kwargs
   1092     }
   1093     return io.TextIOWrapper(
-> 1094         self.open(
   1095             path,
   1096             mode,
   1097             block_size=block_size,
   1098             cache_options=cache_options,
   1099             compression=compression,
   1100             **kwargs,
   1101         ),
   1102         **text_kwargs,
   1103     )
   1104 else:

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/fsspec/spec.py:1106, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1105 ac = kwargs.pop("autocommit", not self._intrans)
-> 1106 f = self._open(
   1107     path,
   1108     mode=mode,
   1109     block_size=block_size,
   1110     autocommit=ac,
   1111     cache_options=cache_options,
   1112     **kwargs,
   1113 )
   1114 if compression is not None:

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc/fs/dvc.py:264, in _DVCFileSystem._open(self, path, mode, **kwargs)
    263 dvc_path = _get_dvc_path(dvc_fs, subkey)
--> 264 return dvc_fs.open(dvc_path, mode=mode)

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_objects/fs/base.py:197, in FileSystem.open(self, path, mode, **kwargs)
    196     kwargs.pop("encoding", None)
--> 197 return self.fs.open(path, mode=mode, **kwargs)

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/fs.py:70, in DataFileSystem.open(self, path, mode, encoding, **kwargs)
     67 def open(  # type: ignore
     68     self, path: str, mode="r", encoding=None, **kwargs
     69 ):  # pylint: disable=arguments-renamed, arguments-differ
---> 70     fs, fspath = self._get_fs_path(path, **kwargs)
     71     return fs.open(fspath, mode=mode, encoding=encoding)

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/fs.py:46, in DataFileSystem._get_fs_path(self, path)
     45 def _get_fs_path(self, path: "AnyFSPath"):
---> 46     info = self.info(path)
     47     if info["type"] == "directory":

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/fs.py:109, in DataFileSystem.info(self, path, **kwargs)
    108 key = self._get_key(path)
--> 109 info = self.index.info(key)
    110 info["name"] = path

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc_data/index/index.py:162, in BaseDataIndex.info(self, key)
    161 except KeyError as exc:
--> 162     raise FileNotFoundError from exc

FileNotFoundError: 

The above exception was the direct cause of the following exception:

FileMissingError                          Traceback (most recent call last)
Cell In [4], line 1
----> 1 with open(path='test/titi.txt', repo='/Users/utilisateur/Documents/data-science') as file:
      2     print(file.read())

File ~/.pyenv/versions/3.10.5/lib/python3.10/contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    133 del self.args, self.kwds, self.func
    134 try:
--> 135     return next(self.gen)
    136 except StopIteration:
    137     raise RuntimeError("generator didn't yield") from None

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc/api/data.py:198, in _open(path, repo, rev, remote, mode, encoding)
    196 def _open(path, repo=None, rev=None, remote=None, mode="r", encoding=None):
    197     with Repo.open(repo, rev=rev, subrepos=True, uninitialized=True) as _repo:
--> 198         with _repo.open_by_relpath(
    199             path, remote=remote, mode=mode, encoding=encoding
    200         ) as fd:
    201             yield fd

File ~/.pyenv/versions/3.10.5/lib/python3.10/contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    133 del self.args, self.kwds, self.func
    134 try:
--> 135     return next(self.gen)
    136 except StopIteration:
    137     raise RuntimeError("generator didn't yield") from None

File ~/.pyenv/versions/3.10.5/envs/dvc-issue/lib/python3.10/site-packages/dvc/repo/__init__.py:531, in Repo.open_by_relpath(self, path, remote, mode, encoding)
    529         yield fobj
    530 except FileNotFoundError as exc:
--> 531     raise FileMissingError(path) from exc
    532 except IsADirectoryError as exc:
    533     raise DvcIsADirectoryError(f"'{path}' is a directory") from exc

FileMissingError: Can't find 'test/titi.txt' neither locally nor on remote

Expected

As I provide the repository path to the open method, I expect the method to understand that the relative path provided is relative to the repository root directory and not to the current working directory. By the way, this is highlighted in the documentation:

path (str): location and file name of the target to open,
        relative to the root of `repo`.
repo (str, optional): location of the DVC project or Git Repo.
            Defaults to the current DVC project (found by walking up from the
            current working directory tree).
            It can be a URL or a file system path.
            Both HTTP and SSH protocols are supported for online Git repos
            (e.g. [user@]server:project.git).

Overall, I believe that path handling should be reworked as the API is really unclear on this at the moment. Indeed, there is no documentation on how the paths are being handled by the open method based on their nature, to my mind, it could be something like this:

Environment information

Output of dvc doctor:

DVC version: 2.37.0 (pip)
---------------------------------
Platform: Python 3.10.5 on macOS-13.0.1-x86_64-i386-64bit
Subprojects:
    dvc_data = 0.28.4
    dvc_objects = 0.14.0
    dvc_render = 0.0.15
    dvc_task = 0.1.6
    dvclive = 1.0.1
    scmrepo = 0.1.4
Supports:
    http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git
efiop commented 1 year ago

Not able to reproduce with older or newer versions. @francoispichot1 Could you give a new version a try and see if you still have the same problem?

efiop commented 1 year ago

Correction, I can indeed reproduce, but got confused by description.

johan-sightic commented 1 year ago

We are also experiencing this problem