iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

dvc exp run --temp: Collecting files and computing hashes takes a lot of time #9823

Open lefos99 opened 1 year ago

lefos99 commented 1 year ago

Bug Report

Description

I have the following DVC structure (output of dvc dag)

+------------------------------+         +---------------------+         +--------------------------+         
| data/raw/image_infos.csv.dvc |         | data/raw/images.dvc |         | data/raw/annotations.dvc |         
+------------------------------+         +---------------------+ ********+--------------------------+         
                        ***               ***           *********                                             
                           ***         ***     *********                                                      
                              **     **   *****                                                               
                          +---------------+         +------------------------------------------------------+  
                          | generate_data |         | data/models/pretrained_model_template.state_dict.dvc |  
                          +---------------+*        +------------------------------------------------------+  
                                            ****                   ****                                       
                                                ****           ****                                           
                                                    **       **                                               
                                                    +-------+                                                 
                                                    | train |                                                 
                                                    +-------+   

and my dvc.yaml (simplified version) looks like this:

vars:
  - dvc_model:
      top_folder_dir: data/models/dvc_model

stages:
  generate_data:
    desc: 'Data generation'
    cmd: python3 code/data_processing/generate_data.py
    deps:
      - ${data_processing.paths.image_infos_csv}
      - ${data_processing.paths.raw_data_dir}/images
      - ${data_processing.paths.raw_data_dir}/annotations
      - code/scenario/classes.py
      - ../base/data_processing/generate_data.py
      - ../base/data_processing/annotations.py
      - ../base/data_processing/dataset.py
      - ../base/scenario/classes.py
    params:
      - data_processing.patch_extraction
      - data_processing.calculate_stats_for_helper_classes
    outs:
      - ${data_processing.paths.generated_data_dir}/patches/train:
          cache: true
          push: false
      - ${data_processing.paths.generated_data_dir}/annotations:
          cache: true
          push: false
      - ${data_processing.paths.generated_data_dir}/project_descriptor.json:
          cache: true
          push: false
  train:
    desc: 'Trains a model for k epochs'
    cmd: python3 code/training/run_training.py 
      train.pl_environment.paths.dir_dvc_model_out=${dvc_model.top_folder_dir}
    deps:
      - ${data_processing.paths.generated_data_dir}/patches/train
      - ${data_processing.paths.generated_data_dir}/project_descriptor.json
      - ${data_processing.paths.generated_data_dir}/patches/train/stats_rgb.csv
      - ${train.pl_environment.paths.pretrained_model}
      - ../base/data_processing/dataset.py
      - ../base/training/run_training.py
      - code/training/run_training.py
      - code/scenario/classes.py
    params:
      - train.pl_hparams
      - train.pl_environment.best_model_scraper
      - train.pl_environment.seed
      - train.pl_environment.num_workers
    outs:
      - ${dvc_model.top_folder_dir}/model.pt:
          cache: true

The stage generate_data has a heavy ouptut ${data_processing.paths.generated_data_dir}/patches/train. It is pretty heavy as it contains a big number of patches (577.331 files). So to initiate an isolated experiment (by either queued experiments or temp experiments):

  1. the operation Collecting files and computing hashes in data/generated_datasets/default/patches/train takes a lot of time. (sometimes even 12 minutes) :turtle:
  2. the operation Collecting files and computing hashes ... is being executed multiple times, which I don't understand why. :thinking:

More precisely by running in verbose mode (-v), I get the following waiting times:

  1. It checkouts the generated data to the temp (from the local cache). This takes some time but it is understandable.
  2. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 1st time: (takes ~1 min with 14kfile/s)
    .
    .
    2023-08-09 10:11:24,121 DEBUG: Assuming 'train' to be a stage inside 'dvc.yaml'
    2023-08-09 10:11:24,225 DEBUG: Computed stage: 'data/raw/image_infos.csv.dvc' md5: 'None'
    'data/raw/image_infos.csv.dvc' didn't change, skipping                                                                                                                                                             
    2023-08-09 10:11:24,229 DEBUG: Computed stage: 'data/raw/images.dvc' md5: 'None'
    2023-08-09 10:11:24,268 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'                                                                                                                            
    'data/raw/images.dvc' didn't change, skipping                                                                                                                                                                      
    2023-08-09 10:11:24,275 DEBUG: Computed stage: 'data/raw/annotations.dvc' md5: 'None'
    2023-08-09 10:11:24,644 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'                                                                                                                            
    'data/raw/annotations.dvc' didn't change, skipping                                                                                                                                                                 
    2023-08-09 10:11:24,680 DEBUG: built tree 'object 852322da91aa5beeeb2df0040845c397.dir'                                                                                                                            
    2023-08-09 10:11:24,917 DEBUG: built tree 'object c3550914cbfc0de52b202987e2811746.dir'                                                                                                                            
    Collecting files and computing hashes in data/generated_datasets/default/patches/train 
  3. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 2nd time: (takes 12 mins with 700file/s)
    .
    .
    Stage 'generate_data' didn't change, skipping                                                                                                                                                                      
    2023-08-09 10:12:31,275 DEBUG: Computed stage: 'data/models/pretrained_model_template.state_dict.dvc' md5: 'None'
    'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping                                                                                                                                     
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  4. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 3rd time: (takes 1 min with 14kfile/s)
    .
    .
    'data/models/pretrained_model_template.state_dict.dvc' didn't change, skipping                                                                                                                                     
    2023-08-09 10:24:58,062 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:24:58,170 DEBUG: Dependency 'data/generated_datasets/default/patches/train' of stage: 'train' changed because it is 'modified'.                                                                      
    2023-08-09 10:24:58,172 DEBUG: stage: 'train' changed.
    2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.ckpt' of stage: 'train'.
    2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.ckpt'
    2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.state_dict' of stage: 'train'.
    2023-08-09 10:24:58,176 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.state_dict'
    2023-08-09 10:24:58,176 DEBUG: Removing output 'data/models/dvc_model/model.pt' of stage: 'train'.
    2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/model.pt'
    2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/stats_rgb.csv' of stage: 'train'.
    2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/stats_rgb.csv'
    2023-08-09 10:24:58,177 DEBUG: Removing output 'data/models/dvc_model/hparams.yaml' of stage: 'train'.
    2023-08-09 10:24:58,177 DEBUG: Removing '/fastdata/users/lefos/code/tissue_segmenter/tissue_segmenter/pdl1_gastric/.dvc/tmp/exps/standalone/tmp2_zrngw3/tissue_segmenter/pdl1_gastric/data/models/dvc_model/hparams.yaml'
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  5. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 4th time: (takes ~1 min with 14kfile/s)
    .
    .
    2023-08-09 10:25:43,247 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:28:54,516 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  6. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 5th time: (takes ~1 min with 14kfile/s)
    .
    .                                                                                                                         
    2023-08-09 10:29:38,244 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:29:38,341 DEBUG: {'data/generated_datasets/default/patches/train': 'modified'}                                                                                                                       
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  7. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 6th time: (takes ~1 min with 14kfile/s)
    .
    .                                                                                                                    
    Collecting files and computing hashes in data/generated_datasets/default/patches/train                                                                                                   |301k [00:21, 15.5kfile/s]
    2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  8. Collecting files and computing hashes in data/generated_datasets/default/patches/train - 7th time: (takes ~1 min with 14kfile/s)
    2023-08-09 10:30:21,875 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:30:30,896 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    2023-08-09 10:31:14,382 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:31:23,380 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  9. Finally, the train stage is invoked :heavy_check_mark:
    2023-08-09 10:32:16,223 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    Running stage 'train':                                                                                                                                                                                             
    > python3 code/training/run_training.py train.pl_environment.paths.dir_dvc_model_out=data/models/dvc_model `
  10. After train is done, once again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
    [2023-08-09 10:39:15,126][segmenttools.data.copy][WARNING] - Folder data/models/dvc_model already exists. This folder will be replaced!
    [2023-08-09 10:39:15,451][segmenttools.inference_pipeline.gpu_availability][INFO] - Released lock for GPU: 1 by process with id: 1916752
    [2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  11. And again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
    2023-08-09 10:39:15,452][tissue_segmenter.base.training.run_training][INFO] - Training FINISHED with status success. ---
    2023-08-09 10:40:00,411 DEBUG: built tree 'object d911dc9bebfada16d63187ef66b2317e.dir'                                                                                                                            
    2023-08-09 10:40:09,511 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    2023-08-09 10:40:10,658 DEBUG: Computed stage: 'train' md5: 'e4a099e883959330ae7111a6383d6006'                                                                                                                     
    Collecting files and computing hashes in data/generated_datasets/default/patches/train
  12. And again Collecting files and computing hashes in data/generated_datasets/default/patches/train (takes ~1 min with 14kfile/s)
    2023-08-09 10:41:03,037 DEBUG: built tree 'object e40e9aa27383d080b194c41cbd6cbcce.dir'                                                                                                                            
    Collecting files and computing hashes in data/generated_datasets/default/patches/train

Expected

I would expect for the operation Collecting files and computing hashes in data/generated_datasets/default/patches/train not to be invoked so many times!

Environment information

Output of dvc doctor:

$ dvc doctor
-------------------------
Platform: Python 3.8.13 on Linux-5.4.0-153-generic-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 2.11.0
        dvc_objects = 0.24.1
        dvc_render = 0.5.3
        dvc_task = 0.3.0
        scmrepo = 1.0.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
        ssh (sshfs = 2023.4.1)
Config:
        Global: /home/deep/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Caches: local
Remotes: ssh, s3
Workspace directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/840dbd56066b2803ce384b8868549044
daavoo commented 1 year ago

Hi @lefos99 , it looks like this might be a duplicate of #9085

daavoo commented 1 year ago

Could you share the viztracer profile:

pip install viztracer
dvc exp run -v --temp --viztracer-depth=10
lefos99 commented 1 year ago

Could you share the viztracer profile:

pip install viztracer
dvc exp run -v --temp --viztracer-depth=10

Hey @daavoo , Thanks for your reply. :bow:
Here is a screenshot from the viewer: image

Please let me know if you would need anything else.

dberenbaum commented 1 year ago

Hi @lefos99! Would you have time for a call to walk through your scenario so we can better understand your pain points and see how we can help either by improving performance or suggesting changes to your workflow?

daavoo commented 1 year ago

For the record, I think this might be a duplicate / affected by #9085

lefos99 commented 1 year ago

Hi @lefos99! Would you have time for a call to walk through your scenario so we can better understand your pain points and see how we can help either by improving performance or suggesting changes to your workflow?

Hey @dberenbaum, I would be more than happy to have a call. Because of vacations, I will be available from next week on. Here are some meeting suggestions: https://calendar.app.google/8VMCwMBNEVBKohEWA.

dberenbaum commented 1 year ago

Booked a time. Looking forward to it!