iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.88k stars 1.18k forks source link

Malformed files when importing directory #9070

Open jbogar opened 1 year ago

jbogar commented 1 year ago

Bug Report

import: malformed files

Description

  1. In repository some_repository, I added each file in some_directory to dvc (so each file in the directory has its own .dvc file)
  2. In another repository, I imported the whole directory with dvc import git@github.com/some_repository some_directory -o another_directory, so that there is a single .dvc file for whole directory

However, some of the files were malformed (.jsonl files, some lines were cut off in the middle). After deleting the files from cache and running dvc update, the files would be downloaded correctly, but other files, at random, became malformed.

I manually checked the hash of the files. The hash of the malformed files was different than the file hash in the .dvc/cache/, so it's likely they were malformed during download.

Colleague observed the same behavior (with different files) with dvc==2.45.0, my dvc version is 2.43.2

Expected

That files are not malformed or raise an error if they are. Command that recalculates hashes and checks whether files are downloaded correctly would also be useful.

Output of dvc doctor:

DVC version: 2.43.2 (brew)
--------------------------
Platform: Python 3.11.1 on macOS-12.6.3-x86_64-i386-64bit
Subprojects:
        dvc_data = 0.37.3
        dvc_objects = 0.19.1
        dvc_render = 0.1.0
        dvc_task = 0.1.11
        dvclive = 1.4.0
        scmrepo = 0.1.7
Supports:
        azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.0),
        gs (gcsfs = 2023.1.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2023.1.0, boto3 = 1.24.59),
        ssh (sshfs = 2023.1.0),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.1.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git
efiop commented 1 year ago

@jbogar Are you using real s3 or some s3-compatible storage?

jbogar commented 1 year ago

Real s3

efiop commented 1 year ago

@jbogar Do you use hardlink/symlink link types? Please show your dvc config --list

jbogar commented 1 year ago

Output of dvc config --list (redacted with <...>):

remote.<remote>.url=s3://<bucket>/<remote>/
core.autostage=true
core.remote=<remote>
remote.<remote>.url=s3://<bucket>/<remote>
remote.<remote>.profile=<profile>
core.remote=<remote>
efiop commented 1 year ago

@jbogar Are you able to reliably reproduce this? Does this only happen with import git@github.com/some_repository but not with clone git@github.com/some_repository & pull?

Also, could you try modifying config in some_repository to set verify=true for the default remote? That will make dvc not trust the remote and verify file hashes locally.

johan-sightic commented 1 year ago

I also have this exact problem

efiop commented 1 year ago

@johan-sightic Please post more info. E.g. dvc doctor output.

johan-sightic commented 1 year ago

@efiop I just broke my entire dvc setup trying to fix another problem but I will come back if I solve it

jbogar commented 1 year ago

@efiop I tried to reproduce the issue. Here are the steps (without verify==true)

And three out of 40 files have malformed content (not empty, malforrmed).

So this may be one reason for this: interrupted updates or this timing issue.

EDIT: I tried again, this time without interrupt. The same error about timing appeared for multiple files again, but this time, data is correct. So in the previous case, it might have been the interrupt.

EDIT2: I reproduced again, no timing errors this time. The steps:

efiop commented 1 year ago

@jbogar That's really strange. We try to download to temporary files and then just move them into place to be semi-atomic and avoid any corruption. Maybe there is a regression with the recent transfer changes, will take a look.

@pmrowla FYI ^

efiop commented 1 year ago

@jbogar Btw, could you try downgrading s3fs to 2022.11.0 and see if that makes it work again? I'm not able to reproduce so far.