iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

push: dvc incorrectly parses .gitignore on push but not on add #10461

Closed chrisdonlan closed 3 months ago

chrisdonlan commented 3 months ago

Bug Report

push: dvc push fails to push but WILL ADD and PULL (destroying existing data) under certain valid .gitignore configurations

Description

DVC handles large files in directories I in general ignore:

$ ls
data/

$ cat .gitignore
data/
!data/*.dvc
!data/**/*.dvc

Under these conditions, I ignore the data/ directory, except for any *.dvc files within that directory. DVC will not add any .dvc files if they are ignored by .gitignore, but will add them if I allow them via the everything-except pattern, !data/*.dvc.

However, even if those files are added and cached, dvc will not push them unless I remove data/ from .gitignore.

Reproduce

  1. dvc init
  2. mkdir data
  3. touch data/foo.txt
  4. create the following gitignore file:
data/
!data/*.txt
  1. dvc add data/foo.txt
  2. dvc push -r some-remote

Expected

I expected dvc to push the files to the remote. If I dvc pull, all of my files being tracked by dvc but which are shadowed by the gitignore line data/ will be destroyed. But if I push, nothing will be added to the remote.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.51.2 (brew)
--------------------------
Platform: Python 3.12.4 on macOS-14.5-arm64-arm-64bit
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.5
Supports:
        azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.1),
        gdrive (pydrive2 = 1.19.0),
        gs (gcsfs = 2024.6.0),
        hdfs (fsspec = 2024.6.0, pyarrow = 16.1.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.6.0, boto3 = 1.34.106),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2024.6.0)
Config:
        Global: /Users/cdonlan3/Library/Application Support/dvc
        System: /Users/cdonlan3/.homebrew/share/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Users/cdonlan3/.homebrew/var/cache/dvc/repo/88246aebc73ddb72e6fa74e23134abbd
dberenbaum commented 3 months ago

Under these conditions, I ignore the data/ directory, except for any *.dvc files within that directory.

Is there a reason you want to manually manage the gitignore file? In the scenario you describe, dvc should automatically update what gets gitignored for you, and it should match what you want here (the dvc-tracked data will be ignored but the .dvc files will be tracked).

Also, is there a reason you don't track all of data/?

data/ !data/*.dvc !data/*/.dvc

You are completely ignoring the data/ directory here, including .dvc files. See https://stackoverflow.com/a/67243109/3127500. You probably want something like this:

data/*
!data/*/
!data/*.dvc

Reproduce

1. dvc init

2. mkdir data

3. touch data/foo.txt

4. create the following gitignore file:
data/
!data/*.txt
5. `dvc add data/foo.txt`

6. `dvc push -r some-remote`

I'm unable to reproduce this. dvc add data/foo.txt fails with the error ERROR: bad DVC file name 'data/foo.txt.dvc' is git-ignored. This is the expected result, since as explained above, all of data/ is still ignored.