Closed mbspng closed 1 week ago
Do you use local config in repo_a
to define the remote storage?
from storage local (/tmp/tmpsuoa_qcgdvc-cache/files/md5)
is it a correct path? is it defined somewhere?
Do you use local config in
repo_a
to define the remote storage?
Hi, yes, I use .dvc/config.local
. But I tried with .dvc/config
also, and it is the same problem.
from storage local (/tmp/tmpsuoa_qcgdvc-cache/files/md5)
is it a correct path? is it defined somewhere?
Hm, I don't understand your question. That path comes from within the dvc module. I suppose it downloads things into a temporary directory.
But I tried with .dvc/config also, and it is the same problem.
was it committed into Git? did it have enough information alone to download objects from remote or you still had some local config or some other way to provide credentials for example?
Hm, I don't understand your question. That path comes from within the dvc module. I suppose it downloads things into a temporary directory.
👍
I nuked the blob store container to make sure. I made a clean one and re-added the data with dvc push
from repo-a
. The data is clean and the .dvc
files are on the GitLab server on a feature branch. dvc pull
works without issues in the cloned version of repo-a
(the one under /tmp
). I am providing an azure blob storage connection string in .dvc/config(.local)
of both repos. There are no other credentials used to access the storage.
I added log statements into the dvc code:
def _get_fs_path(self, path: "AnyFSPath", info=None) -> FileInfo:
from .index import StorageKeyError
info = info or self.info(path)
if info["type"] == "directory":
raise IsADirectoryError(errno.EISDIR, os.strerror(errno.EISDIR), path)
entry: Optional[DataIndexEntry] = info["entry"]
assert entry
hash_info: Optional[HashInfo] = entry.hash_info
for typ in ["cache", "remote", "data"]:
try:
info = self.index.storage_map[entry.key]
storage = getattr(info, typ)
if not storage:
logger.error("No %s storage for %s", typ, entry) # <-- added
continue
data = storage.get(entry)
except (ValueError, StorageKeyError) as err:
logger.error("Failed to get %s file from %s: %s", typ, storage, err) # <-- added
continue
if data:
fs, fs_path = data
if fs.exists(fs_path):
return FileInfo(typ, storage, info.cache, hash_info, fs, fs_path)
raise FileNotFoundError(errno.ENOENT, "No storage files available", path)
This is dvc_data.fs.DataFileSystem._ger_fs_path
.
I get
2024-10-01 11:52:21,512 ERROR: No remote storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
2024-10-01 11:52:21,512 ERROR: No data storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
2024-10-01 11:52:21,519 ERROR: unexpected error - [Errno 2] No storage files available: 'data/master-table.csv'
for
uv run dvc get git@gitlab.com:<org>/repo-a.git data/master-table.csv -vvv
So it says remote=None
. I do not know if that is significant here.
So when I pause the dvc program with an Ipython.embed()
like this:
def _get_fs_path(self, path: "AnyFSPath", info=None) -> FileInfo:
from .index import StorageKeyError
info = info or self.info(path)
if info["type"] == "directory":
raise IsADirectoryError(errno.EISDIR, os.strerror(errno.EISDIR), path)
entry: Optional[DataIndexEntry] = info["entry"]
assert entry
hash_info: Optional[HashInfo] = entry.hash_info
for typ in ["cache", "remote", "data"]:
try:
info = self.index.storage_map[entry.key]
storage = getattr(info, typ)
if not storage:
logger.error("No %s storage for %s", typ, entry)
continue
else:
logger.debug("%s storage for %s", typ, entry)
data = storage.get(entry)
print("data", data)
except (ValueError, StorageKeyError) as err:
logger.error("Failed to get %s file from %s: %s", typ, storage, err)
continue
if data:
fs, fs_path = data
print("fs_path", fs_path)
print("fs", fs)
import IPython; IPython.embed()
if fs.exists(fs_path):
return FileInfo(typ, storage, info.cache, hash_info, fs, fs_path)
and I check on the temporary directory, I see that it is empty.
The print is fs_path /tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7
But
-> % tree /tmp/tmpdv7tn0ngdvc-cache/
/tmp/tmpdv7tn0ngdvc-cache/
0 directories, 0 files
More prints and logs I added:
2024-10-01 12:06:02,860 DEBUG: cache storage for DataIndexEntry(key=('data', 'master-table.csv'), meta=Meta(isdir=False, size=639, nfiles=None, isexec=False, version_id=None, etag=None, checksum=None, md5='24c548ad6dc838a396dd928cbb1a01b7', inode=None, mtime=None, remote=None, is_link=False, destination=None, nlink=1), hash_info=HashInfo(name='md5', value='24c548ad6dc838a396dd928cbb1a01b7', obj_name=None), loaded=None)
data (<dvc_objects.fs.local.LocalFileSystem object at 0x700c1c19ff10>, '/tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7')
fs_path /tmp/tmpdv7tn0ngdvc-cache/files/md5/24/c548ad6dc838a396dd928cbb1a01b7
fs <dvc_objects.fs.local.LocalFileSystem object at 0x700c1c19ff10>
It says remote=None
.
storage = getattr(info, typ)
only results in a non-None assignment for typ=="cache"
. Where is that coming from? I suppose it should assign for typ=="remote"
? But then also the entry
record does not specify a remote for that, as shown above.
I am providing an azure blob storage connection string in .dvc/config(.local)
what does it exactly mean? Is it .dvc/config
or .dvc/config.local
?
can you try to do dvc get
, but before that do export AZURE_STORAGE_CONNECTION_STRING='mysecret'
?
export AZURE_STORAGE_CONNECTION_STRING=<the connection string>
same error.
what does it exactly mean? Is it .dvc/config or .dvc/config.local?
I tried with both. First used the one, then the other. Same error.
okay, just to make sure we are running this on a clean env - have you tried to drop site_cache_dir
for both repos (e.g. /var/tmp/dvc/repo/bdf5f37be5108aada94933a567e64744
)?
also, when dvc get
runs and does the clone to a temp dir. Can you get to that dir and try to run dvc fetch -v
there? also dvc version
, also dvc config --list --show-origin
Bug Report
Description
I have tracked files in
repo-a
underdata
.dvc import
anddvc get
both fail when trying to get files fromrepo-a
inrepo-b
.Reproduce
I cloned my own repo (
repo-a
) under/tmp
to test whetherdvc pull
works. It does. Then I checked status and remote:So that is all correct.
Then I go to my
repo-b
. I configured the remote to be the same as the one ofrebo-a
. Here is the check:Then I try to get the data from
repo-a
. It failsThen I tried if I can push from
repo-b
. I can.Same problem when I target a specific file:
But the file IS on the remote. I can pull it in the cloned
repo-a
.Also, see this:
This file (
files/md5/8a/6de34918ed22935e97644bf465f920.dir
) DOES exist on the remote!Environment information
Output of
dvc doctor
:I already deleted
/var/tmp/dvc/
. Did not help.