iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

Dataset delete when using 'dvc add' due to low memory on pc #10413

Closed Soeren09 closed 4 months ago

Soeren09 commented 4 months ago

Bug Report

Issue name

dvc add /path: original datafiles disappeared after running 'dvc add' on a pc with too little storage.

Description

I was adding some newly recorded dataset files to my repository and the 'dvc add' command crashed mid way due to storage space. After the crash I noticed that my original datafiles had disappeared. I could reproduce the issue by running the same command on another datafile. image

Reproduce

Fill up your disk storage. Locate a large data file and use 'dvs add'. The files now disappears from your directory.

Example:

  1. Locate dataset.csv
  2. dvc add dataset.csv

Expected

I would expect an error message saying "out of storage" and then my directory should be restored to the original state.

Environment information

Output of dvc doctor:


DVC version: 3.37.0 (exe)
-------------------------
Platform: Python 3.10.11 on Windows-10-10.0.19045-SP0
Subprojects:

Supports:
        azure (adlfs = 2023.12.0, knack = 0.11.0, azure-identity = 1.15.0),
        gdrive (pydrive2 = 1.18.1),
        gs (gcsfs = 2023.12.2.post1),
        hdfs (fsspec = 2023.12.2, pyarrow = 14.0.2),
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2023.12.0),
        s3 (s3fs = 2023.12.2, boto3 = 1.33.13),
        ssh (sshfs = 2023.10.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.12.2)
Config:
        Global: C:\Users\sggr\AppData\Local\iterative\dvc
        System: C:\ProgramData\iterative\dvc
Cache types: hardlink
Cache directory: NTFS on C:\
Caches: local
Remotes: local
Workspace directory: NTFS on C:\
Repo: dvc, git
Repo.site_cache_dir: C:\ProgramData\iterative\dvc\Cache\repo\e053a3fa44d04a3fc6a37bf82409b061

Additional Information (if any):

dberenbaum commented 4 months ago

The data should never be deleted. It is likely stored in the cache and needs to be checked out back to the workspace.

In the screen shot you provided, it notes that some targets could not be linked from the cache to the workspace and directs you to https://dvc.org/doc/user-guide/troubleshooting#cache-types. It also provides instructions on how to resolve the problem, by reconfiguring your cache types and the running dvc checkout.

From the dvc doctor output you provided, it seems you have set up hardlinks, so it seems like dvc was unable to hardlink the data from the cache back to the workspace.

Soeren09 commented 4 months ago

Thank you for the reply. I managed to resolve the cache problem by switch to my "vs code bash" terminal instead of using my "git bash" terminal. I think everything is in order again now.