iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.66k stars 1.17k forks source link

push: Some files are not pushed to aws s3 compatible storage #7303

Closed ipovalyaev closed 7 months ago

ipovalyaev commented 2 years ago

Description

Some of the files are not being pushed to aws s3 compatible storage (spaces at digital ocean). No any error is displayed, but actual push doesn't happen. I didn't find any useful information how to check if my files are actually in the bucket and how can I find those :( Or any other ways to find out what going on under the cover (except studying the source code in repo, which I didn't got to)

Reproduce

I have a local remote, but want to share files with others over digital ocean bucket

  1. git clone https://github.com/lacmus-foundation/ladd-and-weights
  2. dvc pull -r local_remote
  3. dvc status -c -v -r lacmus_remote (this says Cache and remote 'local_remote' are in sync)
  4. dvc push -v -r digital_ocean This seems doesn't shows any errors, but some files are not pushed. Seems it the last with 6 files in it in listing below

(tried to clean bucket, set dvc remote modify digital_ocean listobjects true, but this didn't changed anything)

2022-01-24 15:11:50,636 DEBUG: Indexing pushed dir 'md5: 516af06e416d139268177166a6905ba4.dir' with '824' nested files                     
2022-01-24 15:11:50,878 DEBUG: Indexing pushed dir 'md5: e275b9d939bf83da86619b4a8a6642b3.dir' with '2709' nested files                    
2022-01-24 15:11:50,970 DEBUG: Indexing pushed dir 'md5: c128b994175a5c93301469712a71dd52.dir' with '31' nested files                      
2022-01-24 15:11:50,972 DEBUG: Indexing pushed dir 'md5: fcec47e1f47b82c26b6eb451aa5a3cce.dir' with '718' nested files                     
2022-01-24 15:11:51,001 DEBUG: Indexing pushed dir 'md5: 0021915e68e016b080e43d08c11d9c86.dir' with '88' nested files                      
2022-01-24 15:11:51,008 DEBUG: Indexing pushed dir 'md5: c2469eee880536348fb2df3175b96c36.dir' with '3949' nested files                    
2022-01-24 15:11:51,147 DEBUG: Indexing pushed dir 'md5: 6031d51ac8d53b6c4c3216bd63c74f5a.dir' with '772' nested files                     
2022-01-24 15:11:51,176 DEBUG: Indexing pushed dir 'md5: 67f97371bfa58febaca0c6400a155a16.dir' with '266' nested files                     
2022-01-24 15:11:51,189 DEBUG: Indexing pushed dir 'md5: c3d98bc7be2cb8c80333528a1fe6874a.dir' with '538' nested files                     
2022-01-24 15:11:51,212 DEBUG: Indexing pushed dir 'md5: 14bc821c674e44e47f5bab402c595b6d.dir' with '6' nested files                       
9911 files pushed                                                                                                                          

Some files didn't got there

(dvc) proforg@proforg:/media/proforg/Data/lacmus/tmp/ladd-and-weights$ dvc status -c -r digital_ocean -v
2022-01-24 15:38:43,788 DEBUG: Adding '/media/proforg/Data/lacmus/tmp/ladd-and-weights/.dvc/config.local' to gitignore file.
2022-01-24 15:38:43,818 DEBUG: Adding '/media/proforg/Data/lacmus/tmp/ladd-and-weights/.dvc/tmp' to gitignore file.
2022-01-24 15:38:43,819 DEBUG: Adding '/media/proforg/Data/lacmus/tmp/ladd-and-weights/.dvc/cache' to gitignore file.
2022-01-24 15:38:43,888 DEBUG: Lockfile for 'dvc.yaml' not found      
2022-01-24 15:38:44,954 DEBUG: Preparing to collect status from 'dvc'
2022-01-24 15:38:44,987 DEBUG: Collecting status from 'dvc'
2022-01-24 15:38:45,227 DEBUG: Querying 10 hashes via object_exists
2022-01-24 15:38:45,467 DEBUG: Querying 0 hashes via object_exists                                                                         
2022-01-24 15:38:45,624 DEBUG: Estimated remote size: 4096 files                                                                           
2022-01-24 15:38:45,624 DEBUG: Querying '6' hashes via traverse                                                                            
2022-01-24 15:38:45,646 DEBUG: Preparing to collect status from '/media/proforg/Data/lacmus/tmp/ladd-and-weights/.dvc/cache'               
2022-01-24 15:38:45,655 DEBUG: Collecting status from '/media/proforg/Data/lacmus/tmp/ladd-and-weights/.dvc/cache'
        new:                weights/yolo5/yolo5_fullDS_native.pt                                                                           
    new:                weights/yolo5/yolo5_fullDS_TF.pb
    new:                weights/torch/pretrain/resnet50_SDD.pth
    new:                weights/torch/experimental/resnet50_FRCNN_SDD_epoch_9.pth
    new:                weights/keras-retinanet/resnet50_liza_alert_prod.h5
    new:                weights/torch/experimental/resnet50_FRCNN_LADD_epoch_9.pth
2022-01-24 15:38:46,100 DEBUG: Analytics is enabled.

Attempt to fetch those files from other machine, or different folder on other machine with

  1. with git clone
  2. dvc remote modify digital_ocean credentialpath /media/proforg/Data/lacmus/.aws/credentials (same one was used for push)
  3. dvc pull

gives

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                          
name: weights/yolo5/yolo5_fullDS_TF.pb, md5: 049171ca83a35de90b250967a6da45f8                                                              
name: weights/torch/experimental/resnet50_FRCNN_LADD_epoch_9.pth, md5: dfec2c7f61ef2752b0270c36a5601ac5
name: weights/yolo5/yolo5_fullDS_native.pt, md5: 5400e4d2a34d59b0deb9d7a1030decc8
name: weights/torch/pretrain/resnet50_SDD.pth, md5: 62d8d0def79df800be81831325a62ea6
name: weights/keras-retinanet/resnet50_liza_alert_prod.h5, md5: 0e0d5fa91b6b8f14a500c5ffc6eab70a
name: weights/torch/experimental/resnet50_FRCNN_SDD_epoch_9.pth, md5: f5887fdc93f17e1f4c2b204896b280dc

Expected

All files are pushed and available for pull

Environment information

dvc doctor
DVC version: 2.9.3 (conda)
---------------------------------
Platform: Python 3.9.9 on Linux-5.4.0-96-generic-x86_64-with-glibc2.31
Supports:
    webhdfs (fsspec = 2021.11.1),
    http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
    s3 (s3fs = 2021.11.1, boto3 = 1.19.8),
    ssh (sshfs = 2021.11.2),
    webdav (webdav4 = 0.9.3),
    webdavs (webdav4 = 0.9.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/my_data
Caches: local
Remotes: local, s3
Workspace directory: ext4 on /dev/mapper/my_data
Repo: dvc, git
ipovalyaev commented 2 years ago

Applied some fancy hack - modified base.py to False. After this hack and push-from-original-dir, pull-from-new-one, seems issue has gone.

I may suspect digital ocean doesn't properly supports traverse, but can't confirm it for 100%. If this would be the case - would be nice to add option to avoid traverse on particular storage and add note in documentation, where it refers to digital ocean as aws compatible.

NB (not relevant to the issue, but maybe worth mentioning in doc?) : Seems digital ocean is not 100% stable, but setting dvc remote modify digital_ocean connect_timeout 600 helped to rectify errors like

ERROR: failed to transfer 'md5: cd508e4b837c3c968372b679fd49f1ee' - Could not connect to the endpoint URL: "https://lacmus-dvc.fra1.digitaloceanspaces.com/dvc/cd/508e4b837c3c968372b679fd49f1ee"
karajan1001 commented 2 years ago

@ipovalyaev Thanks for your reports as well as the analysis on the digital ocean.

ipovalyaev commented 2 years ago

Update: without aforementioned patch even dvc pull is failing on new installation, so it makes pull not working for anyone, the issue impact not only those, who pushed.
This could be regression of s3 issue https://github.com/iterative/dvc/issues/6691 in previous version

2022-01-27 10:31:04,000 DEBUG: Querying 10 hashes via object_exists
2022-01-27 10:31:04,176 DEBUG: Querying 0 hashes via object_exists                                                                            
2022-01-27 10:31:04,283 DEBUG: Estimated remote size: 4096 files                                                                              
2022-01-27 10:31:04,284 DEBUG: Querying '6' hashes via traverse                                                                               
2022-01-27 10:31:04,331 WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                     
name: weights/keras-retinanet/resnet50_liza_alert_prod.h5, md5: 0e0d5fa91b6b8f14a500c5ffc6eab70a
name: weights/yolo5/yolo5_fullDS_TF.pb, md5: 049171ca83a35de90b250967a6da45f8
name: weights/yolo5/yolo5_fullDS_native.pt, md5: 5400e4d2a34d59b0deb9d7a1030decc8
name: weights/torch/pretrain/resnet50_SDD.pth, md5: 62d8d0def79df800be81831325a62ea6
name: weights/torch/experimental/resnet50_FRCNN_LADD_epoch_9.pth, md5: dfec2c7f61ef2752b0270c36a5601ac5
name: weights/torch/experimental/resnet50_FRCNN_SDD_epoch_9.pth, md5: f5887fdc93f17e1f4c2b204896b280dc