iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

DVC fetch saves items to a different cache than the dvc checkout points to #10453

Open Arob113 opened 3 weeks ago

Arob113 commented 3 weeks ago

Bug Report

  1. dvc pull -r "remote" all.csv
  2. dir with first two digits of md5 and has the cache file has shows up in SOURCE_DIR/.dvc/cache
  3. 2024-06-07 14:20:47,840 DEBUG: failed to create 'FILEPATH/all.csv' from 'SOURCE_DIR.dvc/cache/files/md5/7e/643e15408257a8a04befacb2320ecd' - [Errno 2] No such file or directory: 'SOURCE_DIR/.dvc/cache/files/md5/7e/643e15408257a8a04befacb2320ecd'

Expected

  1. dvc pull -r "remote" all.csv
  2. dir with first two digits of md5 and has the cache file shows up in SOURCE_DIR/.dvc/cache/files/md5 and the file gets added to my repo

Environment information

DVC version: 3.50.0 (pip)

Platform: Python 3.10.9 on Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Subprojects: dvc_data = 3.15.1 dvc_objects = 5.1.0 dvc_render = 1.0.2 dvc_task = 0.4.0 scmrepo = 3.3.5 Supports: http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3), https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3), s3 (s3fs = 2024.5.0, boto3 = 1.34.106)

Additional Information (if any):

shcheklein commented 3 weeks ago

I'm not sure I understand the description, @Arob113 could you please better explain and / or suggest better title for the issue please.

dir with first two digits of md5 and has the cache file has shows up in SOURCE_DIR/.dvc/cache

what does it mean? why is it a problem?

Arob113 commented 2 weeks ago

@shcheklein, this means that it will fetch it to the cache, but it doesnt add the file to my repo. I would need to manually move the cache dir and re-pull.

shcheklein commented 2 weeks ago

I would need to manually move the cache dir and re-pull.

can you share the exact command please?

Arob113 commented 2 weeks ago

dvc pull -r aws-legacy all.csv

shcheklein commented 2 weeks ago

dvc pull -r aws-legacy all.csv

@Arob113 what part in this command is manually move the cache dir? could you please share all the steps / commands / details?

Arob113 commented 2 weeks ago

@shcheklein, I run that pull command and it saves the cache file in .dvc/cache/XX/.... I then manually copy and paste that cache file and folder into .dvc/cache/files/md5 (resulting in .dvc/cache/diles/md5/XX/...) and rerun the pull and the proper csv is checked out

shcheklein commented 2 weeks ago

@Arob113 did you use DVC 2 before? E.g. did you push into aws-legacy with DVC 2? Can you also share the content of the all.csv.dvc (at least the structure of it)?

Arob113 commented 2 weeks ago

For some of the files affected, we had them in dvc 2.xx before but some of the files affected were only ever saved with dvc 3. all.csv specifically was originally dvc 2. I have tried the local cache migration fix, but that didnt seem to work either.

all.csv.dvc:

outs:
- hash: md5
  md5: 7e643e15408257a8a04befacb2320ecd
  path: all.csv
  size: 322629732
  cloud:
    aws-legacy:
      etag: caae4444654e35810c55e708d31a7304-7
      version_id: rVDuvJLrfq.iIsfiET25izm3901gzwtf
shcheklein commented 2 weeks ago

The file structure is already DVC 3.0. The bug you described - is it happening with this file, or the previous version of it (DVC 2). Was cloud versioning enabled for the remote storage before?

Did you run the migration with --dvc-files ?

(I'm still trying to understand the full picture to reproduce this, otherwise it's quite hard to guess / understand what is happening)