iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.96k stars 1.19k forks source link

`dvc import-url`: can't pull data if using `--no-download` #10594

Open DGrady opened 1 month ago

DGrady commented 1 month ago

Bug Report

Description

The documentation for import-url explains that running this command:

dvc import-url --no-download s3://mybucket/data.csv

should create a DVC metadata file with the pointer and hash information for the source data file, and that it should not download the data immediately. That works as expected.

The documentation also states that if I later run

dvc pull data.csv

at that point, it will download the data and place it in my work tree. (I guess it's not clear whether the data will be added to the cache?) This doesn't work; instead

> dvc pull data.csv
Collecting
Fetching
Building workspace index
Comparing indexes
Applying changes
Everything is up to date.
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data.csv
Is your cache up to date?
<https://error.dvc.org/missing-files>

Reproduce

Expected

Based on the documentation, my expectation is that

dvc import-url --no-download s3://mybucket/data.csv
dvc pull data.csv

should copy data.csv to my local work tree from S3, that data.csv should not be added to the cache, and that any changes to data.csv in S3 should cause local pipelines that use data.csv as a dependency to be flagged as out of date.

This expected behavior is explained in a couple of places in the documentation:

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.12.7 on macOS-15.0.1-x86_64-i386-64bit
Subprojects:
    dvc_data = 3.16.6
    dvc_objects = 5.1.0
    dvc_render = 1.0.2
    dvc_task = 0.40.2
    scmrepo = 3.3.8
Supports:
    http (aiohttp = 3.10.10, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.10.10, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2024.9.0, boto3 = 1.35.36)
Config:
    Global: /Users/dan/Library/Application Support/dvc
    System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/a60ba71a9b85f3c8d2c283884465924b

Additional Information (if any):