iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

pull: breaks if imported dir ends with "/" #10426

Closed afaul closed 3 weeks ago

afaul commented 1 month ago

Bug Report

Description

When a directory is imported with dvc import and the directory ends with / dvc pull is unable to get the imported files in a clean clone of the repository.

Reproduce

Run this bash-script to reproduce the bug.

rm -rf dvc-test
mkdir dvc-test
cd dvc-test

mkdir repoA
cd repoA
python3 -m venv env
source env/bin/activate
pip install -q dvc
pip install -q dvc-s3
git init
dvc init
mkdir data
dvc import https://github.com/iterative/dataset-registry.git tutorials/nlp/  -o data/   ## broken
# dvc import https://github.com/iterative/dataset-registry.git tutorials/nlp  -o data/   ## working
git add data/nlp.dvc data/.gitignore
git commit -m "commit"
deactivate

cd ..
git clone repoA repoB

cd repoB
python3 -m venv env
source env/bin/activate
pip install -q dvc
pip install -q dvc-s3
dvc pull
deactivate

cd ..

ls -l repoA/data
ls -l repoB/data

Expected

dvc pull should be able to get the data like dvc import

Environment information

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.12.3 on Linux-6.8.9-arch1-2-x86_64-with-glibc2.39
Subprojects:
    dvc_data = 3.15.1
    dvc_objects = 5.1.0
    dvc_render = 1.0.2
    dvc_task = 0.4.0
    scmrepo = 3.3.3
Supports:
    http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
    s3 (s3fs = 2024.3.1, boto3 = 1.34.69)
Config:
    Global: /home/afaul/.config/dvc
    System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vg_ssd-lv_home
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/mapper/vg_ssd-lv_home
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/69d484a87eba683f3683324f5c8f57f4

Additional Information (if any):

% dvc pull --verbose
2024-05-13 19:55:34,755 DEBUG: v3.50.1 (pip), CPython 3.12.3 on Linux-6.8.9-arch1-2-x86_64-with-glibc2.39
2024-05-13 19:55:34,755 DEBUG: command: /home/afaul/Downloads/dvc-test/repoB/env/bin/dvc pull --verbose
2024-05-13 19:55:35,660 DEBUG: Creating external repo https://github.com/iterative/dataset-registry.git@f59388cd04276e75d70b2136597aaa27e7937cc3
2024-05-13 19:55:35,660 DEBUG: erepo: git clone 'https://github.com/iterative/dataset-registry.git' to a temporary dir              
Collecting                                                                                                |4.00 [00:01, 3.50entry/s]
Fetching                                                                                                                            
Building workspace index                                                                                  |1.00 [00:00,  379entry/s]
Comparing indexes                                                                                        |7.00 [00:00, 1.25kentry/s]
2024-05-13 19:55:36,999 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./.gitignore'. It won't be created.
2024-05-13 19:55:36,999 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./.gitignore' from 'None'            
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

2024-05-13 19:55:37,002 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./Posts.xml.zip'. It won't be created.
2024-05-13 19:55:37,002 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./Posts.xml.zip' from 'None'         
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

2024-05-13 19:55:37,003 WARNING: No file hash info found for '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./pipeline.zip'. It won't be created.
2024-05-13 19:55:37,003 DEBUG: failed to create '/home/afaul/Downloads/dvc-test/repoB/data/nlp/./pipeline.zip' from 'None'          
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/checkout.py", line 94, in _create_files
    src_fs, src_path = storage_obj.get(entry)
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc_data/index/index.py", line 198, in get
    raise ValueError
ValueError

Applying changes                                                                                          |0.00 [00:00,     ?file/s]
2024-05-13 19:55:37,004 DEBUG: Removing '/home/afaul/Downloads/dvc-test/repoB/data/nlp'
No remote provided and no default remote set.
Everything is up to date.
2024-05-13 19:55:37,005 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data/nlp
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/commands/data_sync.py", line 35, in run
    stats = self.repo.pull(
            ^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/pull.py", line 42, in pull
    stats = self.checkout(
            ^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/afaul/Downloads/dvc-test/repoB/env/lib/python3.12/site-packages/dvc/repo/checkout.py", line 184, in checkout
    raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/nlp
Is your cache up to date?
<https://error.dvc.org/missing-files>

2024-05-13 19:55:37,011 DEBUG: Analytics is enabled.
2024-05-13 19:55:37,073 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpmpwaa9dc', '-v']
2024-05-13 19:55:37,083 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpmpwaa9dc', '-v'] with pid 3404
2024-05-13 19:55:37,085 DEBUG: Removing '/tmp/tmpr14fojvgdvc-clone'
2024-05-13 19:55:37,089 DEBUG: Removing '/tmp/tmpt_4flv1tdvc-cache'
dberenbaum commented 1 month ago

The difference is that the dependency path is saved as tutorials/nlp/ instead of tutorials/nlp. We should either be stripping the final / there or treating these as equivalent in the dvc-data index and everywhere else.