iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.7k stars 1.18k forks source link

DVC is filling my files with zeros #9542

Closed JohnAtl closed 1 year ago

JohnAtl commented 1 year ago

Bug Report

dvc add : I restored files from a backup (because dvc previously filled them with zeros). I ran my check program, and they were not filled with zeros after being restored. I ran dvc status which showed md5 being calculated for large files, then:

data.dvc:                                                                                                                         
    changed outs:                                                                                                             
        modified:           data

So I did dvc add data, and the progress bar showed files being added. I then ran my program to check for files filled with zeros, and the 325 files I previously restored from a backup have been filled with zeros again.

Output of dvc doctor:

❯ dvc doctor
DVC version: 2.58.2 (pip)
-------------------------
Platform: Python 3.10.10 on Linux-6.1.0-9-amd64-x86_64-with-glibc2.36
Subprojects:
    dvc_data = 0.51.0
    dvc_objects = 0.22.0
    dvc_render = 0.5.3
    dvc_task = 0.2.1
    scmrepo = 1.0.3
Supports:
    http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
    ssh (sshfs = 2023.4.1)
Config:
    Global: /home/john/.config/dvc
    System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/mapper/homevg-root
Caches: local
Remotes: ssh
Workspace directory: btrfs on /dev/mapper/homevg-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/148934283df74966750dca7c8ef8acf5
efiop commented 1 year ago

Hi @JohnAtl , most likely you've corrupted your cache at some point, where you had a hardlink or symlink and didn't properly unprotect it like suggested in https://dvc.org/doc/user-guide/how-to/update-tracked-data

Though having files filled with literal zeroes is even more odd, something else might've corrupted files on remote maybe?

JohnAtl commented 1 year ago

My destination repo drive (sftp to another computer) filled, and dvc wasn't able to push. I've since remedied that. The filling with zeros is happening locally. According to Getting Started, one just adds files, then modifies them, dvc adding changes at will. How would one know to unlink file(s)? dvc doctor does not indicate any problems.

And all that aside, I would consider silently filling files with zeros to be a bug, no matter the cause.

How do I go about fixing this?

JohnAtl commented 1 year ago

Also, thanks for the quick response!

efiop commented 1 year ago

@JohnAtl Could you check files in the remote, to see if they are corrupted? Note that we also have verify option for the remote that will not trust the hashes from it and will try to recalculate them when downloading (e.g. dvc remote modify myremote verify true), this option is off by default for ssh.

And all that aside, I would consider silently filling files with zeros to be a bug, no matter the cause.

We are not doing that explicitly anywhere, I'm not sure where specifically it is happening so far.

JohnAtl commented 1 year ago

Thanks.

The verify command returns immediately:

❯ dvc remote modify imac verify true -v -v -v
2023-06-05 15:43:37,161 DEBUG: v2.58.2 (pip), CPython 3.10.10 on Linux-6.1.0-9-amd64-x86_64-with-glibc2.36
2023-06-05 15:43:37,161 DEBUG: command: /home/john/miniconda3/bin/dvc remote modify imac verify true -v -v -v
2023-06-05 15:43:37,161 TRACE: Namespace(cprofile=False, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, show_stack=False, quiet=0, verbose=3, cd='.', cmd='modify', level=None, name='imac', option='verify', value='true', unset=False, func=<class 'dvc.commands.remote.CmdRemoteModify'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-06-05 15:43:37,213 DEBUG: Writing '/home/john/Work/Neurogram/Sleep/.dvc/config'.
2023-06-05 15:43:37,214 DEBUG: Analytics is enabled.
2023-06-05 15:43:37,222 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpunuquvc2']'
2023-06-05 15:43:37,223 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpunuquvc2']'
efiop commented 1 year ago

@JohnAtl Sorry for confusion. That option will affect a fresh (meaning you don't have files in your .dvc/cache) dvc fetch/pull. Could you try either using a fresh clone of your project or deleting existing .dvc/cache and doing dvc fetch now?

JohnAtl commented 1 year ago

Sure, let me do a quick backup...

JohnAtl commented 1 year ago

Okay, 100GiB, so that took a bit. Here's the output of dvc status:

❯ dvc status
data.dvc:                                                                                                                         
    changed outs:
        not in cache:       data

Should I just dvc add now, then try a push? Thanks

efiop commented 1 year ago

@JohnAtl Make sure you've set that verify config option and try dvc fetch. If that's what you already did and that's what dvc status is showing - it means that files on the remote are corrupted and dvc fetch tried to verify hash and found that it was incorrect and thus didn't put the files in the cache. Sorry, but looks like remote files are corrupted :( Did you run out of space locally or on the remote?

JohnAtl commented 1 year ago

On the remote.

efiop commented 1 year ago

@JohnAtl Are those files 0-byte sized or literally filled with zeros? I'm not sure of this is a remote FS problem (are you using some fancy filesystem there?) or sftp or how we use it. This will need deep investigation.

JohnAtl commented 1 year ago

The files are literally filled with zeros, and seem to be the same size as they were. E.g. 1.2GiB, or what have you. I haven’t verified exact numbers.

The remote is on my iMac Pro, the drives are two Sabrent Rocket XTRM 2TiB Thunderbolt 3 drives using macOS RAID0 to create a striped 4TiB drive. The file system is APFS. The sshd is standard issue macOS , and the macOS is Ventura. I can get version numbers for everything tomorrow if you’d like.

JohnAtl commented 1 year ago

How do you recommend I proceed? Just empty and init the repo on the remote? I’ll lose history, but honestly, don’t really need it. Also, really appreciate your support through this difficult time :-)

efiop commented 1 year ago

@JohnAtl If you don't care about the history - then yeah, I would delete the remote and start over. Otherwise I would prune corrupted files manually or with some kind of script (unfortunately there is no dedicated command for that in dvc right now, but we do plan on introducing dvc cache check and probably dvc remote check in the future).

JohnAtl commented 1 year ago

That's what I wound up doing. I think I'm all set now. Thanks again for the help!

efiop commented 1 year ago

@JohnAtl It seems like there is an extension for sftp that can check available space, but it is not universally available, so we would not be able to rely on it. Unfortunately it looks like there is nothing much we can do from our side here right now :(

Glad to hear that reset helped and you can continue using it! Thank you for the feedback!

Closing for now, since there doesn't seem much we can do from our side.