iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.88k stars 1.18k forks source link

pull: clones repositories for imported files #9738

Open peper0 opened 1 year ago

peper0 commented 1 year ago

Description

dvc pull clones repositories from which files were imported, even though they are cached (have cache: true implicitly or explicitly).

Reproduce

  1. dvc init
  2. dvc import any file from a different git repository
  3. dvc push
  4. clear the local cache
  5. dvc pull

At step 5 the repository is being cloned.

Expected

I expect data to be pushed to the remote in dvc push and pulled from the remote in dvc pull since the data is cached by default without accessing the git repository it was imported from (unless dvc update is called).

This is a big problem, since the git repo may be not accessible when dvc pull is called (e.g. when it is called by CI server). Moreover, it takes a lot of time if data is imported from several repositories with some large ones among them.

In my understanding, outputs are synced with the repository only in dvc update and dvc import. Not at dvc pull or dvc repro. Therefore I don't see why the repo would need to be accessible when calling dvc pull

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.58.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.4.0-150-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 0.51.0
        dvc_objects = 0.23.0
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/tlakota/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1
Caches: local
Remotes: ssh, ssh
Workspace directory: ext4 on /dev/sdc
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9d372b24e0a6ee54ffae81f6983b321a
dberenbaum commented 1 year ago

Thanks for the issue @peper0. The current behavior of dvc import is to always download from source and never push imported data to the remote. So the dvc push in your example should not have any impact. The idea is that in most cases users would rather access the source git repo than have an entire extra copy of the dvc-tracked data in remote storage. There's some related discussion about being able to push imports in #4527.

peper0 commented 1 year ago

@dberenbaum what about the push option of the outputs? Shouldn't it decide whether push the file to the remote?

dberenbaum commented 1 year ago

Yes, ideally import would set push: false and you could change it to push: true to get the behavior you want. Unfortunately, I don't think it's that simple today because import predates the introduction of the push option.

cc @efiop

dberenbaum commented 1 year ago

If you are open to a hacky workaround for now, you could make a dvc stage that does dvc get, which would track it as a normal output that gets pushed.

peper0 commented 1 year ago

@dberenbaum Yes, that's the direction that I'm going to migrate. But it has considerable drawbacks, like no support for dvc update.

dberenbaum commented 1 year ago

@peper0 If your stage cmd looks like dvc get --rev some_rev repo_url path, then you can update the --rev field to get update-like functionality, which AFAIK is more or less what import does. I don't plan to close this issue since it's a legitimate request to have all this included in import, but hopefully that at least makes it usable for now since I don't think it's something we can fix that quickly or can prioritize right now.