iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

dvc data status reports imported directories as "not in remote" #9346

Closed johnyaku closed 1 year ago

johnyaku commented 1 year ago

Bug Report

Description

dvc data status reports imported directories as "not in remote".

Technically, this is correct, as the data is in the remote for the source repo, not the current repo. But this is a bit confusing.

Reproduce

dvc import <source-repo> <directory>
dvc data status

Expected

Either no message about not being in the remote. Or, perhaps more helpfully, dvc data status could look up the remote for the source repo and check if the data is there, and only report a problem if it is not found.

Environment information

dvc 2.53

daavoo commented 1 year ago

@efiop looks like imports are not being considered at all in the DataIndex and thus in dvc data status.

Is there a plan to include info about an entry being part of an import stage in the data index or should this "filter" happen at the UI level (i.e. post-process repo.data_status() output to account for import stages)?

daavoo commented 1 year ago

@dberenbaum I am putting p1 here as it creates quite some noise in repos using dvc import which is an important scenario + we are pushing for dvc data status, but feel free to re-assign a different priority.

efiop commented 1 year ago

imports are recorded in the index, but as a source for the outputs, which they really are.

The issue here is really that in data status we check against a particular remote, which, as expected, doesn't have imports. We should check against all corresponding remotes instead (e.g. if there are per-output remotes), which for imports might also mean that we should skip them.

johnyaku commented 1 year ago

We've updated to v2.58.2 and we are no longer seeing "not in remote".

Instead, the status of imports is "deleted", which again feels misleading.

@dlroden

dberenbaum commented 1 year ago

@johnyaku when does it show as "deleted"? When they are missing locally, or even when they exist locally?

johnyaku commented 1 year ago

The only common denominator in the spurious "deleted" reports is that all of these files have been imported from a dvc data registry.

The files report as "deleted" all exist in the local workspace, as links to a shared external cache. The checksums in the link paths match the checksums in the .dvc files. The files all exist on the remote for the source registry, but not in the remote of the destination dataset (as we would expect).

So this looks to me like another twist on the "not in remote" message, which has been fixed by no longer using this as the default message, but I suspect that the basic problem is the same. Namely, dvc data status does not seem to distinguish "imports" from "indigenous" data.

efiop commented 1 year ago

@johnyaku Are you still able to reproduce the issues? These days dvc data status checks against relevant cache/remote. I can't reproduce your problem so far though. If you could come up with a reproducible script - that would help.

efiop commented 1 year ago

@johnyaku Or, if you are still able to reproduce, I'm happy to maybe jump on a quick call to figure it out at the spot.

johnyaku commented 1 year ago

Apologies for the slow response on this one. Thanks to your help the other day we have our index mirroring sorted out now and I can confirm that I can reproduce what my colleague was seeing. @dlroden

johnyaku commented 1 year ago

I tried to create a reprex using DVC v3.15.2 to check if this had been fixed since v2.58.2.

I made a toy registry here: https://github.com/johnyaku/test_reg

This contains one file (test.txt) which I have pushed to a local "remote" at ../test_reg_remote.

I then create a toy dataset here: git@github.com:johnyaku/imp_test.git

Nothing to see there yet, because dvc import failed:

dvc import git@github.com:johnyaku/test_reg.git test.txt
Importing 'test.txt (git@github.com:johnyaku/test_reg.git)' -> 'test.txt'
ERROR: unexpected error - [Errno 2] No storage files available: 'test.txt'

This seems reminiscent of this issue: https://github.com/iterative/dvc-gdrive/issues/29

I can paste a full stack trace if you like, but the main take-aways are as follows:

  File "/home/johree/miniconda3/envs/dmdb/lib/python3.11/site-packages/dvc_data/fs.py", line 73, in _get_fs_path
    raise FileNotFoundError(
FileNotFoundError: [Errno 2] No storage files available: 'test.txt'

DVC version: 3.15.2 (conda)
---------------------------
Platform: Python 3.11.4 on Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 2.13.1
        dvc_objects = 0.25.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.2.1
Supports:
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.7.0)
Config:
        Global: /home/johree/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: ssh, local
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/b0f5a2f92faec80920e3bf13e3e8daa

I know we have veered off into a new issue here, but I'll need to work thru this in order to create a reprex.

efiop commented 1 year ago

@johnyaku Can't reproduce original issue anymore with newest dvc.

Regarding import, there was probably something else wrong there, as I also can't reproduce it.

Feel free to create a new issue if you run into anything still not working.