iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

dvc pull: does not pull out folder but says "everything is up to date", `dvc push` "pushes" them over and over again #10492

Closed Luux closed 3 months ago

Luux commented 3 months ago

Bug Report

Description

I want to pull the results of a given stage. dvc pull claims everything is up to date, but the folder is not created on my local machine.

Our setup:

What I've tried so far:

It's even worse: dvc pull run_28/outdir from an older branch works, switching to the new branch, dvc pull run_28/outdir says "everything is up-to-date" but there should be changes to the files

Reproduce

The data affected is customer data, so I cannot provide the files. For the other my_stage entries, everything seems to work...

Expected

dvc pull should work as expected or at least show a meaningful error message

Environment information

Output of dvc doctor: Local machine:

DVC version: 3.52.0 (pip)
-------------------------
Platform: Python 3.12.2 on Linux-6.5.0-10043-tuxedo-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.6
Supports:
        azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
        gdrive (pydrive2 = 1.20.0),
        gs (gcsfs = 2024.6.1),
        hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.6.1, boto3 = 1.34.131),
        ssh (sshfs = 2024.6.0),
        webdav (webdav4 = 0.10.0),
        webdavs (webdav4 = 0.10.0),
        webhdfs (fsspec = 2024.6.1)
Config:
        Global: /home/[redacted]/.config/dvc
        System: /home/[redacted]/.config/kdedefaults/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-root
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/mapper/system-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/97d71e2ea2ce50da29b2deb40f96d2c6

Worker machine:

DVC version: 3.52.0 (pip)
-------------------------
Platform: Python 3.11.4 on Linux-5.15.0-113-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.1
        dvc_task = 0.3.0
        scmrepo = 3.3.6
Supports:
        azure (adlfs = 2024.2.0, knack = 0.11.0, azure-identity = 1.13.0),
        gdrive (pydrive2 = 1.19.0),
        gs (gcsfs = 2024.2.0),
        hdfs (fsspec = 2024.2.0, pyarrow = 12.0.1),
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2024.2.0, boto3 = 1.28.17),
        ssh (sshfs = 2023.7.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2024.2.0)
Config:
        Global: /home/[redacted]/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sdb
Caches: local
Remotes: ssh
Workspace directory: ext4 on /dev/sdb
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/5066c556c44c1c76389054d0aa58ba82

Additional Information (if any):

dvc push run_28/outdir -v:

2024-07-29 14:13:15,515 DEBUG: v3.52.0 (pip), CPython 3.11.4 on Linux-5.15.0-113-generic-x86_64-with-glibc2.35
2024-07-29 14:13:15,515 DEBUG: command: /home/[redacted]/miniconda3/envs/dvc/bin/dvc push run_28/outdir -v
2024-07-29 14:17:55,825 DEBUG: Checking if stage `run_28/outdir' is in 'dvc.yaml'                                                                                                                    
Collecting                                                                                                                                                                                                        |0.00 [00:00,    ?entry/s]
2024-07-29 14:17:56,691 DEBUG: Preparing to transfer data from '/home/[redacted]/2d-plan/.dvc/cache/files/md5' to 'ssh://[redacted]/files/md5'
2024-07-29 14:17:56,691 DEBUG: Preparing to collect status from '[redacted]/files/md5'
2024-07-29 14:17:56,691 DEBUG: Collecting status from '[redacted]/files/md5'
2024-07-29 14:17:56,698 DEBUG: Preparing to transfer data from '[redacted]/.dvc/cache' to 'ssh://[redacted]'
2024-07-29 14:17:56,699 DEBUG: Preparing to collect status from '[redacted]'
2024-07-29 14:17:56,699 DEBUG: Collecting status from '[redacted]'
Pushing
1155 files pushed
2024-07-29 14:17:56,705 DEBUG: Analytics is enabled.
2024-07-29 14:17:56,742 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp6twqobzz', '-v']
2024-07-29 14:17:56,761 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp6twqobzz', '-v'] with pid 979393

dvc pull is similar:

2024-07-29 16:13:05,576 DEBUG: v3.52.0 (pip), CPython 3.12.2 on Linux-6.5.0-10043-tuxedo-x86_64-with-glibc2.35
2024-07-29 16:13:05,576 DEBUG: command: /home/[redacted]/.local/bin/dvc pull run_28/outdir -v
/home/[redacted]/.local/share/pipx/venvs/dvc/lib/python3.12/site-packages/asyncssh/crypto/cipher.py:29: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from this module in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
/home/[redacted]/.local/share/pipx/venvs/dvc/lib/python3.12/site-packages/asyncssh/crypto/cipher.py:30: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import TripleDES
2024-07-29 16:17:47,447 DEBUG: Checking if stage 'run_28/outdir' is in 'dvc.yaml'                                                                                                                    
2024-07-29 16:17:47,895 DEBUG: Creating external repo git@[redacted].git@ea8233250ff2dd93c5f14c19ba1fb501afa0381f                                                                                                   
2024-07-29 16:17:47,896 DEBUG: erepo: git clone 'git@[redacted].git' to a temporary dir                                                                                                                             
2024-07-29 16:17:58,906 DEBUG: Creating external repo git@[redacted].git@ea8233250ff2dd93c5f14c19ba1fb501afa0381f                                                                                                   
2024-07-29 16:17:58,910 DEBUG: Creating external repo git@[redacted].git@a892961f0112e4a653f5955034e95c54f924a34d                                                                                                   
2024-07-29 16:17:58,914 DEBUG: Creating external repo git@[redacted].git@ea8233250ff2dd93c5f14c19ba1fb501afa0381f                                                                                                   
2024-07-29 16:17:58,918 DEBUG: Creating external repo git@[redacted].git@ea8233250ff2dd93c5f14c19ba1fb501afa0381f                                                                                                   
Collecting                                                                                                                                                                                                        |0.00 [00:11,    ?entry/s]
2024-07-29 16:17:59,392 DEBUG: Preparing to transfer data from 'ssh://[redacted]/files/md5' to '/home/[redacted]/.dvc/cache/files/md5'
2024-07-29 16:17:59,393 DEBUG: Preparing to collect status from '[redacted]/.dvc/cache/files/md5'
2024-07-29 16:17:59,393 DEBUG: Collecting status from '[redacted]/.dvc/cache/files/md5'
2024-07-29 16:17:59,877 DEBUG: Preparing to transfer data from 'ssh://[redacted]' to '/[redacted]/.dvc/cache'
2024-07-29 16:17:59,878 DEBUG: Preparing to collect status from '[redacted]/.dvc/cache'
2024-07-29 16:17:59,878 DEBUG: Collecting status from '[redacted]/.dvc/cache'
Fetching
Building workspace index                                                                                                                                                                                          |0.00 [00:00,    ?entry/s]
Comparing indexes                                                                                                                                                                                                |1.00 [00:00, 1.21kentry/s]
Applying changes                                                                                                                                                                                                  |0.00 [00:00,     ?file/s]
Everything is up to date.
2024-07-29 16:18:01,497 DEBUG: Analytics is enabled.
2024-07-29 16:18:01,522 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp4u2bxf0j', '-v']
2024-07-29 16:18:01,527 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp4u2bxf0j', '-v'] with pid 201447
2024-07-29 16:18:01,528 DEBUG: Removing '/tmp/tmpe3ia6n7wdvc-clone'
2024-07-29 16:18:01,529 DEBUG: Removing '/tmp/tmptyolnarmdvc-cache'
griai commented 3 months ago

Just to add my two cents, here ...

I actually work on the same team as the OP. We also tried from different machines, which also did not change anything.

However, there is one piece of information that I can add: When I tried to pull that directory in question, I could verify that at least one file from the original directory actually was fetched to my local cache. (I just manually searched in my cache for the corresponding entry via the hash that I knew from the original.) However, it did not appear in my working directory after checkout (or pull). So what we could still do is we could try and compare the cache and see if maybe even all files in question have been transferred to my local cache. That could mean that something does not work on checkout.

Luux commented 3 months ago

As I stated above, when pushing the directory, only runs is created and contains only ~11mb. No files folder is present`.

If I add the same data with the same hashes added via dvc add, the files folder is created and takes up ~1gb. dvc pull of the manually added files works, pulling the og stage output folder does not.

So I can indeed confirm that it is not just a checkout bug.

Luux commented 3 months ago

Update: we found the cause. In one of the commits of the current branch, the out folder was set to cache: false. So naturally, the files were not uploaded to the cache anymore.

As the old state was indeed cached, this caused weird behaviour of dvc. A warning or similar could be useful in this case instead of dvc just silently behaving this way. Note that this affects all the outs of all my_stages, but only my_stage@28 indicated that something was going wrong at all.

As a suggestion, a warning/error when trying to dvc push or dvc pull an out where cache: false is set, would be helpful.

shcheklein commented 3 months ago

Closing this for now, please feel free to create a separate issue for the a warning/error when trying to dvc push or dvc pull an out where cache: false is set, would be helpful.