iterative / dvc

πŸ¦‰ Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.86k stars 1.19k forks source link

dvc pull/fetch: corrupted cache with GDrive #10525

Open ermolaev94 opened 2 months ago

ermolaev94 commented 2 months ago

Bug Report

Description

I've faced with corrupted files pulled from GDrive remote. This "corruption" is not reproducable - on one machine it happens while on other it doesn't. I'll try to desccribe it in more details below:

File Description

Artifact is a folder consisting from 4 elements. DVC can't download only one of them.

Error Scenario β„–1

$ ssh <server>
$ cd /path/to/repo
$ git checkout develop
$ $ dvc pull ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 --no-run-cache -v                                     
2024-08-15 12:49:02,036 DEBUG: v3.53.2 (pip), CPython 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
2024-08-15 12:49:02,037 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc pull ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 --no-run-cache -v
2024-08-15 12:49:02,481 DEBUG: Lockfile '../05_unf_dt/dvc.lock' needs to be updated.
2024-08-15 12:49:02,773 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
2024-08-15 12:49:04,241 DEBUG: Checking if stage '../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
Collecting                                                                                                                        |2.00 [00:00, 4.05entry/s]
2024-08-15 12:49:07,769 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 12:49:07,770 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 12:49:07,770 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 12:49:07,775 DEBUG: Preparing to collect status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'
2024-08-15 12:49:07,775 DEBUG: Collecting status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'                                                         
2024-08-15 12:49:07,783 DEBUG: Querying 19 oids via object_exists
2024-08-15 12:49:11,906 DEBUG: Querying 0 oids via object_exists              
Fetching                               
  0%|          |Fetching from gdrive                                                                                              0/1 [00:00<?,     ?file/s]
  1%|▏         |1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5/0e/d51826c555065daf92319e2c7a56d2                          100M/7.31G [00:02<03:13,    40.0MB/s]
Computing md5 for a large file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/0e/d51826c555065daf92319e2c7a56d2'. This is only done once.                                                                                                                                                        
2024-08-15 12:52:16,408 DEBUG: corrupted cache file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/0e/d51826c555065daf92319e2c7a56d2'.                                                                                                                                                           
2024-08-15 12:52:16,408 DEBUG: Removing '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5/0e/d51826c555065daf92319e2c7a56d2'            
2024-08-15 12:52:17,292 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'                                                                                                                                             
2024-08-15 12:52:17,292 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
2024-08-15 12:52:17,292 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'
Fetching
2024-08-15 12:52:17,805 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
2024-08-15 12:52:19,269 DEBUG: Checking if stage '../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
Building workspace index                                                                                                          |7.00 [00:00, 7.96entry/s]
Comparing indexes                                                                                                                |8.00 [00:00, 1.14kentry/s]
2024-08-15 12:52:20,318 DEBUG: Removing '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5'
Applying changes                                                                                                                  |1.00 [00:00, 1.14kfile/s]
M       ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/
1 file modified and 1 file fetched
2024-08-15 12:52:20,358 DEBUG: Analytics is enabled.
2024-08-15 12:52:20,441 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpvgz885uf', '-v']
2024-08-15 12:52:20,449 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpvgz885uf', '-v'] with pid 5035

So, it found corrupted data and just say it in debug information. No error, no warning. I see this message only because I ran command with -v flag. Ok, I suggested that data is really corrupted i.e. cache is not the same.

Manual Cache Check

I decided to check what md5 is really for the file that was defined as corrupted. I went to "1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5/0e/d51826c555065daf92319e2c7a56d2" and downloaded it.

image

$ md5sum d51826c555065daf92319e2c7a56d2 
0ed51826c555065daf92319e2c7a56d2  d51826c555065daf92319e2c7a56d2

And cache of the file is ok.

Use own account instead of service acc

I noticed that owner is not the same for files that were pulled and files that weren't. It happens because I decided to use service account several months ago. I decided to try old auth way and to do auth on server by traslating 8080 port, but google blocks it (#10516):

image

Error Scenario β„–2

I decided to pull the same data on local machine and there is no error:

$ git clone <...>
$ git checkout develop
$ dvc pull ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v --no-run-cache
2024-08-15 13:13:47,731 DEBUG: v3.53.1 (pip), CPython 3.10.14 on Linux-6.8.0-31-generic-x86_64-with-glibc2.39
2024-08-15 13:13:47,731 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc pull ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v --no-run-cache
2024-08-15 13:13:48,649 DEBUG: Lockfile for 'ribs/pipelines/06_cage_sgm/dvc.yaml' not found
2024-08-15 13:13:48,678 DEBUG: Lockfile 'ribs/pipelines/05_unf_dt/dvc.lock' needs to be updated.
2024-08-15 13:13:48,836 DEBUG: Checking if stage 'ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
2024-08-15 13:13:49,609 DEBUG: failed to load ('ribs', 'data', 'full_datasets', 'fractures_1123_seg', 'h5-corrected', 'test') from storage local (/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5/68/cfceac0911615ed2e552b1e52c0eaa.dir'
Traceback (most recent call last):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open
    return open(path, mode=mode, encoding=encoding)  # noqa: SIM115
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5/68/cfceac0911615ed2e552b1e52c0eaa.dir'

Collecting                                                                                                                        |2.00 [00:04, 2.24s/entry]
2024-08-15 13:13:54,091 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5' to '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 13:13:54,091 DEBUG: Preparing to collect status from '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 13:13:54,091 DEBUG: Collecting status from '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 13:13:54,093 DEBUG: Preparing to collect status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'
2024-08-15 13:13:54,093 DEBUG: Collecting status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5'                                                         
2024-08-15 13:13:54,093 DEBUG: Querying 1 oids via object_exists
2024-08-15 13:13:57,453 DEBUG: Indexing new .dir '68cfceac0911615ed2e552b1e52c0eaa.dir' with '4' nested files                                               
2024-08-15 13:13:59,302 DEBUG: transfer dir: md5: 68cfceac0911615ed2e552b1e52c0eaa.dir with 1 files                                                         
Fetching                               
  0%|          |Fetching from gdrive                                                                                              0/1 [00:01<?,     ?file/s]
  4%|▍         |1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL/files/md5/0e/d51826c555065daf92319e2c7a56d2                          300M/7.31G [00:10<04:11,    29.9MB/s]
Computing md5 for a large file '/tmp/cvl-cvisionrad-ml/.dvc/cache/files/md5/0e/d51826c555065daf92319e2c7a56d2'. This is only done once.                     
2024-08-15 13:18:56,911 DEBUG: Preparing to transfer data from 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL' to '/tmp/cvl-cvisionrad-ml/.dvc/cache'          
2024-08-15 13:18:56,912 DEBUG: Preparing to collect status from '/tmp/cvl-cvisionrad-ml/.dvc/cache'                                                         
2024-08-15 13:18:56,913 DEBUG: Collecting status from '/tmp/cvl-cvisionrad-ml/.dvc/cache'                                                                   
Fetching
2024-08-15 13:18:57,664 DEBUG: Lockfile for 'ribs/pipelines/06_cage_sgm/dvc.yaml' not found
2024-08-15 13:18:57,859 DEBUG: Checking if stage 'ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
Building workspace index                                                                                                         |5.00 [00:00, 12.9kentry/s]
Comparing indexes                                                                                                                 |8.00 [00:00,  627entry/s]
Applying changes                                                                                                                  |1.00 [00:00,   550file/s]A       ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/
1 file added and 2 files fetched
2024-08-15 13:18:58,499 DEBUG: Analytics is enabled.
2024-08-15 13:18:58,533 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp3cojpnij', '-v']
2024-08-15 13:18:58,540 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp3cojpnij', '-v'] with pid 173652

Error scenario β„–3

I've also tried one more way on the machine which wasn't able to download this file before

$ dvc get ../../../ ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v
2024-08-15 13:15:32,242 DEBUG: v3.53.2 (pip), CPython 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
2024-08-15 13:15:32,242 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc get ../../../ ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v
2024-08-15 13:15:32,696 DEBUG: Lockfile '../05_unf_dt/dvc.lock' needs to be updated.
2024-08-15 13:15:32,989 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
2024-08-15 13:19:55,415 DEBUG: Analytics is enabled.                                                                                                        
2024-08-15 13:19:55,482 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp_5_14ube', '-v']                                                            
2024-08-15 13:19:55,489 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp_5_14ube', '-v'] with pid 5162
(venv) ermolaev@ed943d457e9c:~/projects/radml/cvl-cvisionrad-ml/ribs/pipelines/01_ds_gen_and_analysis$ l
bin.h5  correction.json  dvc.lock  dvc.yaml  params.yaml  README.md
(venv) ermolaev@ed943d457e9c:~/projects/radml/cvl-cvisionrad-ml/ribs/pipelines/01_ds_gen_and_analysis$ md5sum bin.h5 
b053f9713f406497bbe6881d926718f3  bin.h5

the same command, on the same rev, with the same dependencies, but on other machine:

$ dvc get . ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v
2024-08-15 13:25:02,189 DEBUG: v3.53.1 (pip), CPython 3.10.14 on Linux-6.8.0-31-generic-x86_64-with-glibc2.39
2024-08-15 13:25:02,189 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc get . ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 -v
2024-08-15 13:25:03,052 DEBUG: Lockfile for 'ribs/pipelines/06_cage_sgm/dvc.yaml' not found
2024-08-15 13:25:03,081 DEBUG: Lockfile 'ribs/pipelines/05_unf_dt/dvc.lock' needs to be updated.
2024-08-15 13:29:30,956 DEBUG: Analytics is enabled.                                                                                                        
2024-08-15 13:29:30,984 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp96wu6zyl', '-v']                                                            
2024-08-15 13:29:30,990 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp96wu6zyl', '-v'] with pid 174332
(venv) ermolaev@alamak:/tmp/cvl-cvisionrad-ml$ md5sum bin.h5 
0ed51826c555065daf92319e2c7a56d2  bin.h5

Now I see this file and see what his hash is. And it differs by some reason.

Comparing files content

I decided to compare real data from arrays and found out that there are 3 places with byte differences, that's why files and their hashes are not totally the same. But their content is closely the same. It's not very clear why it happens.

RClone

I decided to try download the same file using rclone to exclude system error i.e. ubuntu makes some changes while flashing data on disk.

$ rclone copy -P 'radml:/Y_DVC_CACHE_(DO_NOT_MODIFY!!!)/files/md5/0e/d51826c555065daf92319e2c7a56d2' .                                
Transferred:        4.567 GiB / 7.313 GiB, 62%, 42.477 MiB/s, ETA 1m6s
Transferred:            0 / 1, 0%
Elapsed time:      1m17.3s
Transferring:
 *                d51826c555065daf92319e2c7a56d2: 62% /7.313Gi, 42.418Mi/s, 1m6s
$ md5sum d51826c555065daf92319e2c7a56d2 
0ed51826c555065daf92319e2c7a56d2  d51826c555065daf92319e2c7a56d2

No error via rclone i.e. file hash is correct.

S3

I decided to try download the same file using S3 remote that I also have

$ dvc pull ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 --no-run-cache -v -r yadrive
2024-08-15 14:16:56,230 DEBUG: v3.53.2 (pip), CPython 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
2024-08-15 14:16:56,230 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc pull ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 --no-run-cache -v -r yadrive
2024-08-15 14:16:56,685 DEBUG: Lockfile '../05_unf_dt/dvc.lock' needs to be updated.
2024-08-15 14:16:56,977 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
2024-08-15 14:16:58,429 DEBUG: Checking if stage '../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
Collecting                                                                                                                        |2.00 [00:00, 4.33entry/s]
2024-08-15 14:17:00,671 DEBUG: Preparing to transfer data from 's3://cvisionrad-ml-data/files/md5' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 14:17:00,671 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 14:17:00,671 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache/files/md5'
2024-08-15 14:17:00,674 DEBUG: Preparing to collect status from 'cvisionrad-ml-data/files/md5'
2024-08-15 14:17:00,674 DEBUG: Collecting status from 'cvisionrad-ml-data/files/md5'                                                                        
2024-08-15 14:17:00,674 DEBUG: Querying 1 oids via object_exists
2024-08-15 14:17:00,868 DEBUG: Indexing new .dir '68cfceac0911615ed2e552b1e52c0eaa.dir' with '4' nested files                                               
Fetching                               
  0%|          |Fetching from s3                                                                                                  0/1 [00:00<?,     ?file/s]
  1%|          |cvisionrad-ml-data/files/md5/0e/d51826c555065daf92319e2c7a56d2                                        69.4M/7.31G [00:08<19:40,    6.59MB/s]
2024-08-15 14:19:51,356 DEBUG: Preparing to transfer data from 's3://cvisionrad-ml-data' to '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'    
2024-08-15 14:19:51,357 DEBUG: Preparing to collect status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'                                
2024-08-15 14:19:51,357 DEBUG: Collecting status from '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/.dvc/cache'                                          
Fetching
2024-08-15 14:19:51,928 DEBUG: Lockfile for '../06_cage_sgm/dvc.yaml' not found
2024-08-15 14:19:53,406 DEBUG: Checking if stage '../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5' is in 'dvc.yaml'
Computing md5 for a large file '/home/ermolaev/projects/radml/cvl-cvisionrad-ml/ribs/data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5'. This is only done once.
Building workspace index                                                                                                          |7.00 [00:11, 1.67s/entry]
Comparing indexes                                                                                                                |8.00 [00:00, 1.14kentry/s]
Applying changes                                                                                                                  |0.00 [00:00,     ?file/s]
1 file fetched
2024-08-15 14:20:05,358 DEBUG: Analytics is enabled.
2024-08-15 14:20:05,449 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpx2_m5qv7', '-v']
2024-08-15 14:20:05,456 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpx2_m5qv7', '-v'] with pid 5487

$ md5sum ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5 
0ed51826c555065daf92319e2c7a56d2  ../../data/full_datasets/fractures_1123_seg/h5-corrected/test/bin.h5

We can see that file hash is stil correct, so error happens only with GDrive, but GDrive stores correct data.

Conclusions

I see one from 2 possible errors:

Reproduce

I don't know how to reproduce this error. During more than a 2 years of usage I've faced with this error only 3 times and I don't see strict scenario how to create such problem intentionally.

Expected

Hash must be the same.

Environment information

Ubuntu 22.04

Output of dvc doctor:

DVC version: 3.53.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.0-71-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.6
Supports:
        gdrive (pydrive2 = 1.20.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.34.131)
Config:
        Global: /home/ermolaev/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/cvg-home
Caches: local
Remotes: gdrive, gdrive, gdrive, s3
Workspace directory: ext4 on /dev/mapper/cvg-home
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/35b0ceeafaa400653846b67148221777

Additional Information (if any):

Setting "verify = false" helps, data is correct.

shcheklein commented 2 months ago

So, just to make sure I understand this correctly. DVC is catching those errors from time to time and self-corrects (by deleting those broken files). Is it correct? So, there is no urgent issue in this, right? I think that's why we still keep verify true for the Google Drive remote. It would be great to understand why is it happening though, I agree.

ermolaev94 commented 2 months ago

So, just to make sure I understand this correctly. DVC is catching those errors from time to time and self-corrects (by deleting those broken files). Is it correct? So, there is no urgent issue in this, right? I think that's why we still keep verify true for the Google Drive remote. It would be great to understand why is it happening though, I agree.

Not exactly. File in GDrive has correct md5, but dvc downloading functionality breaks some its bytes. And as a consequence downloaded file has another hash. Moreover, this behavior is not fully deterministic and not fully random. On the same machine different files can be downloaded ok in one docker run and won't be downloaded on another docker run with the same image. And one more interesting - if some bytes were broken during download such bytes will be exactly the same no matter how many times I try to repeat downloading (cache is clean every time).

So, I suspect that error can be in dvc-gdrive or pydrive2 or somewhere else. Maybe there is a sence to move this issue in their repo. But what also I'd like to suggest - write warning. Currently, dvc just silently removes broken file from cache without any message if this command was run without -v. Also, symlink is not removed, it becames broken. I think it's better to write warning and remove everything - from cache & from the target folder.

shcheklein commented 2 months ago

Also, symlink is not removed, it becames broken. I think it's better to write warning and remove everything - from cache & from the target folder.

yep, that's a bug (but I think it should be solved soon, or already solved in the recent release)

I a agree that we should have a better message in such cases.

It's good though that it's not corrupting data still, it detects and removes the broken file.

I suspect that error can be in dvc-gdrive or pydrive2 or somewhere else

yes, it seems to be on the Google Drive API or driver level.

shcheklein commented 2 months ago

And if you run rclone, are you getting any issues at all? Can you try to run it from different machines, a few times.

Also, if those a re the same bytes - can you share their location within a file - just to see if there is a pattern (block ending or something). Is the file size the same (it just replaces those bytes)? What is the replacement then?

ermolaev94 commented 2 months ago

And if you run rclone, are you getting any issues at all? Can you try to run it from different machines, a few times.

Ok, I will try to run such stress test. I manually tried 2-5 times and hash was ok.

Also, if those a re the same bytes - can you share their location within a file - just to see if there is a pattern (block ending or something). Is the file size the same (it just replaces those bytes)? What is the replacement then?

It's a very interesting. This file is hdf5 file with two arrays. The first one is fixed-length array and it's always the same even when data is corrupted. The second one is variable-length array with "uint8" data type, it stores bytes. Variable-length data in hdf5 is stored as separated bytes chunk per such array. There are 3 cases from 80 with errors. Each error in the case had difference 16, 128 or 240. But this what I got. Such bytes are PNG images, after decoding I see small highlights in some pixels when data is corrupted. Array size looks the same, I didn't look in details.

I can share both files - corrupted and correct.

shcheklein commented 2 months ago

I can share both files - corrupted and correct.

that would be amazing. Thanks. And if you get some extra examples of this corrupted file (or some other files).

Muhammad371995 commented 1 month ago

I am facing the same issue and it's reproducible, I used to pull data without any problem couple of months ago and I did not pull it since then, now I am trying to pull data from google drive I got app is blocked when I use my email to authenticate and corrupted files when I try to use service account.

2024-09-29 22:32:51,134 DEBUG: v3.55.2 (snap), CPython 3.9.5 on Linux-6.8.0-45-generic-x86_64-with-glibc2.31
2024-09-29 22:32:51,135 DEBUG: command: /snap/dvc/1479/bin/dvc pull --verbose
Collecting                                                                                                        |1.78k [00:01, 1.32kentry/s]
2024-09-29 22:32:53,630 DEBUG: Preparing to transfer data from 'gdrive://17FkMqUHt409wubemSgT1gwc1OD0UI8rZ' to '/home/ai1/imaging/vascular-ai/.dvc/cache'
2024-09-29 22:32:53,631 DEBUG: Preparing to collect status from '/home/ai1/imaging/vascular-ai/.dvc/cache'
2024-09-29 22:32:53,632 DEBUG: Collecting status from '/home/ai1/imaging/vascular-ai/.dvc/cache'
2024-09-29 22:32:53,657 DEBUG: Preparing to collect status from '17FkMqUHt409wubemSgT1gwc1OD0UI8rZ'
2024-09-29 22:32:53,658 DEBUG: Collecting status from '17FkMqUHt409wubemSgT1gwc1OD0UI8rZ'                                                     
2024-09-29 22:32:53,699 DEBUG: Querying 5 oids via object_exists
2024-09-29 22:32:55,593 DEBUG: Querying 0 oids via object_exists                                                                              
2024-09-29 22:34:17,838 DEBUG: corrupted cache file '/home/ai1/imaging/vascular-ai/.dvc/cache/25/e96f0a45d94508479caa8cf00d3bf9'.             
2024-09-29 22:34:17,838 DEBUG: Removing '/home/ai1/imaging/vascular-ai/.dvc/cache/25/e96f0a45d94508479caa8cf00d3bf9'                          
2024-09-29 22:34:18,032 DEBUG: corrupted cache file '/home/ai1/imaging/vascular-ai/.dvc/cache/40/a9b6a690043e29e0d889bec0eab952'.             
2024-09-29 22:34:18,033 DEBUG: Removing '/home/ai1/imaging/vascular-ai/.dvc/cache/40/a9b6a690043e29e0d889bec0eab952'                          
2024-09-29 22:34:18,166 DEBUG: corrupted cache file '/home/ai1/imaging/vascular-ai/.dvc/cache/51/680bc314a7436af5e91fb66fe7889c'.             
2024-09-29 22:34:18,167 DEBUG: Removing '/home/ai1/imaging/vascular-ai/.dvc/cache/51/680bc314a7436af5e91fb66fe7889c'                          
2024-09-29 22:34:18,230 DEBUG: corrupted cache file '/home/ai1/imaging/vascular-ai/.dvc/cache/fd/2e4448e27e6d3b0c0f35beff6e39a6'.             
2024-09-29 22:34:18,230 DEBUG: Removing '/home/ai1/imaging/vascular-ai/.dvc/cache/fd/2e4448e27e6d3b0c0f35beff6e39a6'                          
Fetching                                                                                                                                      
Building workspace index                                                                                            |20.0 [00:00,  576entry/s]
Comparing indexes                                                                                                 |1.78k [00:01, 1.76kentry/s]
2024-09-29 22:36:52,533 DEBUG: failed to create '/home/ai1/imaging/vascular-ai/data/input/raw/345691/SlicesRIL.12' from '/home/ai1/imaging/vascular-ai/.dvc/cache/25/e96f0a45d94508479caa8cf00d3bf9' - [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/25/e96f0a45d94508479caa8cf00d3bf9'
Traceback (most recent call last):
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 151, in _put_one
    return to_fs.put_file(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 635, in put_file
    self.fs.put_file(os.fspath(from_file), to_info, callback=callback, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/local.py", line 91, in put_file
    copyfile(lpath, tmp_file, callback=callback)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/utils.py", line 168, in copyfile
    total = os.path.getsize(src)
  File "/snap/dvc/1479/bin/../usr/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/25/e96f0a45d94508479caa8cf00d3bf9'

2024-09-29 22:36:52,537 DEBUG: failed to create '/home/ai1/imaging/vascular-ai/data/input/raw/345691/SlicesLIL.12' from '/home/ai1/imaging/vascular-ai/.dvc/cache/51/680bc314a7436af5e91fb66fe7889c' - [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/51/680bc314a7436af5e91fb66fe7889c'
Traceback (most recent call last):
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 151, in _put_one
    return to_fs.put_file(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 635, in put_file
    self.fs.put_file(os.fspath(from_file), to_info, callback=callback, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/local.py", line 91, in put_file
    copyfile(lpath, tmp_file, callback=callback)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/utils.py", line 168, in copyfile
    total = os.path.getsize(src)
  File "/snap/dvc/1479/bin/../usr/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/51/680bc314a7436af5e91fb66fe7889c'

2024-09-29 22:36:52,538 DEBUG: failed to create '/home/ai1/imaging/vascular-ai/data/input/raw/345691/SlicesAO.12' from '/home/ai1/imaging/vascular-ai/.dvc/cache/fd/2e4448e27e6d3b0c0f35beff6e39a6' - [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/fd/2e4448e27e6d3b0c0f35beff6e39a6'
Traceback (most recent call last):
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 151, in _put_one
    return to_fs.put_file(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 635, in put_file
    self.fs.put_file(os.fspath(from_file), to_info, callback=callback, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/local.py", line 91, in put_file
    copyfile(lpath, tmp_file, callback=callback)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/utils.py", line 168, in copyfile
    total = os.path.getsize(src)
  File "/snap/dvc/1479/bin/../usr/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/fd/2e4448e27e6d3b0c0f35beff6e39a6'

2024-09-29 22:36:52,540 DEBUG: failed to create '/home/ai1/imaging/vascular-ai/data/input/raw/345691/SlicesAxial.12' from '/home/ai1/imaging/vascular-ai/.dvc/cache/40/a9b6a690043e29e0d889bec0eab952' - [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/40/a9b6a690043e29e0d889bec0eab952'
Traceback (most recent call last):
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 151, in _put_one
    return to_fs.put_file(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 635, in put_file
    self.fs.put_file(os.fspath(from_file), to_info, callback=callback, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/local.py", line 91, in put_file
    copyfile(lpath, tmp_file, callback=callback)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc_objects/fs/utils.py", line 168, in copyfile
    total = os.path.getsize(src)
  File "/snap/dvc/1479/bin/../usr/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/home/ai1/imaging/vascular-ai/.dvc/cache/40/a9b6a690043e29e0d889bec0eab952'

Applying changes                                                                                                   |1.61k [02:37,  10.2file/s]
2024-09-29 22:36:56,997 DEBUG: Removing '/home/ai1/imaging/vascular-ai/data'
4 files fetched
2024-09-29 22:37:00,333 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 35, in run
    stats = self.repo.pull(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc/repo/pull.py", line 42, in pull
    stats = self.checkout(
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/snap/dvc/1479/lib/python3.9/site-packages/dvc/repo/checkout.py", line 184, in checkout
    raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data
Is your cache up to date?
<https://error.dvc.org/missing-files>

2024-09-29 22:37:00,373 DEBUG: Analytics is enabled.
2024-09-29 22:37:00,444 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp49z_b9n6', '-v']
2024-09-29 22:37:00,480 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp49z_b9n6', '-v'] with pid 60508
shcheklein commented 1 month ago

@Muhammad371995 it seems a different issue. Most likely your service account doesn't have access to the directory with files. Each service account has an email associated with it. Try to go to UI and explicitly allow it to read (and write if you need) the remote storage.

Re the blocked app - please read this https://github.com/iterative/dvc/issues/10516

Muhammad371995 commented 1 month ago

dvc already succeeded to download the entire tracked folder into cache except for 4 files, it downloads those 4 files and then reports that they are corrupted and delete them eventually. To work around this I downloaded the 4 files manually and copied them to their respective locations in cache then checked out the tracked folder successfully

Actually, I am facing 2 different issues the data corruption issue and the issue you thankfully mentioned above https://github.com/iterative/dvc/issues/10516