Closed johnyaku closed 1 year ago
@efiop looks like imports
are not being considered at all in the DataIndex
and thus in dvc data status
.
Is there a plan to include info about an entry being part of an import stage in the data index or should this "filter" happen at the UI level (i.e. post-process repo.data_status()
output to account for import
stages)?
@dberenbaum I am putting p1 here as it creates quite some noise in repos using dvc import
which is an important scenario + we are pushing for dvc data status
, but feel free to re-assign a different priority.
imports are recorded in the index, but as a source for the outputs, which they really are.
The issue here is really that in data status we check against a particular remote, which, as expected, doesn't have imports. We should check against all corresponding remotes instead (e.g. if there are per-output remotes), which for imports might also mean that we should skip them.
We've updated to v2.58.2 and we are no longer seeing "not in remote".
Instead, the status of imports is "deleted", which again feels misleading.
@dlroden
@johnyaku when does it show as "deleted"? When they are missing locally, or even when they exist locally?
The only common denominator in the spurious "deleted" reports is that all of these files have been imported from a dvc data registry.
The files report as "deleted" all exist in the local workspace, as links to a shared external cache.
The checksums in the link paths match the checksums in the .dvc
files.
The files all exist on the remote for the source registry, but not in the remote of the destination dataset (as we would expect).
So this looks to me like another twist on the "not in remote" message, which has been fixed by no longer using this as the default message, but I suspect that the basic problem is the same. Namely, dvc data status
does not seem to distinguish "imports" from "indigenous" data.
@johnyaku Are you still able to reproduce the issues? These days dvc data status
checks against relevant cache/remote. I can't reproduce your problem so far though. If you could come up with a reproducible script - that would help.
@johnyaku Or, if you are still able to reproduce, I'm happy to maybe jump on a quick call to figure it out at the spot.
Apologies for the slow response on this one. Thanks to your help the other day we have our index mirroring sorted out now and I can confirm that I can reproduce what my colleague was seeing. @dlroden
I tried to create a reprex using DVC v3.15.2 to check if this had been fixed since v2.58.2.
I made a toy registry here: https://github.com/johnyaku/test_reg
This contains one file (test.txt
) which I have pushed to a local "remote" at ../test_reg_remote
.
I then create a toy dataset here: git@github.com:johnyaku/imp_test.git
Nothing to see there yet, because dvc import
failed:
dvc import git@github.com:johnyaku/test_reg.git test.txt
Importing 'test.txt (git@github.com:johnyaku/test_reg.git)' -> 'test.txt'
ERROR: unexpected error - [Errno 2] No storage files available: 'test.txt'
This seems reminiscent of this issue: https://github.com/iterative/dvc-gdrive/issues/29
I can paste a full stack trace if you like, but the main take-aways are as follows:
File "/home/johree/miniconda3/envs/dmdb/lib/python3.11/site-packages/dvc_data/fs.py", line 73, in _get_fs_path
raise FileNotFoundError(
FileNotFoundError: [Errno 2] No storage files available: 'test.txt'
DVC version: 3.15.2 (conda)
---------------------------
Platform: Python 3.11.4 on Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
dvc_data = 2.13.1
dvc_objects = 0.25.0
dvc_render = 0.5.3
dvc_task = 0.3.0
scmrepo = 1.2.1
Supports:
http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
ssh (sshfs = 2023.7.0)
Config:
Global: /home/johree/.config/dvc
System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: ssh, local
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/b0f5a2f92faec80920e3bf13e3e8daa
I know we have veered off into a new issue here, but I'll need to work thru this in order to create a reprex.
@johnyaku Can't reproduce original issue anymore with newest dvc.
Regarding import
, there was probably something else wrong there, as I also can't reproduce it.
Feel free to create a new issue if you run into anything still not working.
Bug Report
Description
dvc data status
reports imported directories as "not in remote".Technically, this is correct, as the data is in the remote for the source repo, not the current repo. But this is a bit confusing.
Reproduce
Expected
Either no message about not being in the remote. Or, perhaps more helpfully,
dvc data status
could look up the remote for the source repo and check if the data is there, and only report a problem if it is not found.Environment information
dvc 2.53