neuralmagic / sparsify

ML model optimization product to accelerate inference.
Apache License 2.0
318 stars 28 forks source link

Fix tensorboard reloading #258

Closed rahul-tuli closed 1 year ago

rahul-tuli commented 1 year ago

Tensorboard was outputting logs constantly when running a training aware command for a yolov5-n (base) model trained on coco, this made it difficult to see training progress or if training was actually being run or not. This PR fixes that by suppressing tensorboard logs

Changes include:

Test Command:

sparsify.run training-aware --model "zoo:cv/detection/yolov5-n/pytorch/ultralytics/coco/base-none" --data VOC.yaml --use-case cv-detection --optim-level 0.5

Before this PR:

rahul at quad-mle-1 in ~/projects/sparsify (sparsify) 
$ sparsify.run training-aware --model "zoo:cv/detection/yolov5-n/pytorch/ultralytics/coco/base-none" --data VOC.yaml --use-case cv-detection --optim-level 0.5
Checking for GPU...
GPU check completed successfully
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)

*************************SPARSIFY***************************
TensorBoard listening on http://localhost:6006/
************************************************************

INFO:auto_banner:TensorBoard listening on http://localhost:6006/
INFO:tensorboard:TensorBoard reload process beginning
TensorBoard reload process beginning
INFO:tensorboard:Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:GetLogdirSubdirectories: Starting to list directories via walking.
GetLogdirSubdirectories: Starting to list directories via walking.
INFO:tensorboard:Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:TensorBoard reload process: Reload the whole Multiplexer
TensorBoard reload process: Reload the whole Multiplexer
INFO:tensorboard:Beginning EventMultiplexer.Reload()
Beginning EventMultiplexer.Reload()
INFO:tensorboard:Reloading runs serially (one after another) on the main thread.
Reloading runs serially (one after another) on the main thread.
INFO:tensorboard:Finished with EventMultiplexer.Reload()
Finished with EventMultiplexer.Reload()
INFO:tensorboard:TensorBoard done reloading. Load took 0.001 secs
TensorBoard done reloading. Load took 0.001 secs
INFO:tensorboard:TensorBoard reload process beginning
TensorBoard reload process beginning
INFO:tensorboard:Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:GetLogdirSubdirectories: Starting to list directories via walking.
GetLogdirSubdirectories: Starting to list directories via walking.
INFO:tensorboard:Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:TensorBoard reload process: Reload the whole Multiplexer
TensorBoard reload process: Reload the whole Multiplexer
INFO:tensorboard:Beginning EventMultiplexer.Reload()
Beginning EventMultiplexer.Reload()
INFO:tensorboard:Reloading runs serially (one after another) on the main thread.
Reloading runs serially (one after another) on the main thread.
INFO:tensorboard:Finished with EventMultiplexer.Reload()
Finished with EventMultiplexer.Reload()
INFO:tensorboard:TensorBoard done reloading. Load took 0.001 secs
TensorBoard done reloading. Load took 0.001 secs
INFO:root:Using nproc_per_node=auto.
Using nproc_per_node=auto.
INFO:torch.distributed.elastic.rendezvous.static_tcp_rendezvous:Creating TCPStore as the c10d::Store implementation
Creating TCPStore as the c10d::Store implementation
INFO:tensorboard:TensorBoard reload process beginning
TensorBoard reload process beginning
INFO:tensorboard:Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Starting AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:GetLogdirSubdirectories: Starting to list directories via walking.
GetLogdirSubdirectories: Starting to list directories via walking.
INFO:tensorboard:Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
Done with AddRunsFromDirectory: /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_17_42/logs
INFO:tensorboard:TensorBoard reload process: Reload the whole Multiplexer
TensorBoard reload process: Reload the whole Multiplexer
INFO:tensorboard:Beginning EventMultiplexer.Reload()
Beginning EventMultiplexer.Reload()
INFO:tensorboard:Reloading runs serially (one after another) on the main thread.
Reloading runs serially (one after another) on the main thread.
INFO:tensorboard:Finished with EventMultiplexer.Reload()
Finished with EventMultiplexer.Reload()
INFO:tensorboard:TensorBoard done reloading. Load took 0.001 secs
TensorBoard done reloading. Load took 0.001 secs

After This PR:

$ sparsify.run training-aware --model "zoo:cv/detection/yolov5-n/pytorch/ultralytics/coco/base-none" --data VOC.yaml --use-case cv-detection --optim-level 0.5
Checking for GPU...
GPU check completed successfully
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /home/rahul/venvs/sparsify/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
INFO:root:Using nproc_per_node=auto.
Using nproc_per_node=auto.
INFO:torch.distributed.elastic.rendezvous.static_tcp_rendezvous:Creating TCPStore as the c10d::Store implementation
Creating TCPStore as the c10d::Store implementation
train: weights=zoo:cv/detection/yolov5-n/pytorch/ultralytics/coco/base-none, cfg=, teacher_weights=, data=VOC.yaml, data_path=, hyp=hyp.scratch-low.yaml, epochs=300, batch_size=16, gradient_accum_steps=-1, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=/home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_33_38/training_artifacts, log_dir=/home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_33_38/logs, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=0, freeze=[0], save_period=-1, seed=0, local_rank=-1, recipe=zoo:cv/detection/yolov5-n/pytorch/ultralytics/coco/base-none, recipe_args=None, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 πŸš€ 2023-7-6 Python-3.8.10 torch-2.0.0+cu117 CUDA:0 (NVIDIA RTX A4000, 16117MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 πŸš€ in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 πŸš€ runs in Comet
TensorBoard: Start with 'tensorboard --logdir /home/rahul/projects/sparsify/training_aware_object_detection_2023_07_11_17_33_38', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=20

                 from  n    params  module                                  arguments                     
  0                -1  1      1760  yolov5.models.common.Conv               [3, 16, 6, 2, 2]              
  1                -1  1      4672  yolov5.models.common.Conv               [16, 32, 3, 2]                
  2                -1  1      4800  yolov5.models.common.C3                 [32, 32, 1]                   
  3                -1  1     18560  yolov5.models.common.Conv               [32, 64, 3, 2]                
  4                -1  2     29184  yolov5.models.common.C3                 [64, 64, 2]                   
  5                -1  1     73984  yolov5.models.common.Conv               [64, 128, 3, 2]               
  6                -1  3    156928  yolov5.models.common.C3                 [128, 128, 3]                 
  7                -1  1    295424  yolov5.models.common.Conv               [128, 256, 3, 2]              
  8                -1  1    296448  yolov5.models.common.C3                 [256, 256, 1]                 
  9                -1  1    164608  yolov5.models.common.SPPF               [256, 256, 5]                 
 10                -1  1     33024  yolov5.models.common.Conv               [256, 128, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  yolov5.models.common.Concat             [1]                           
 13                -1  1     90880  yolov5.models.common.C3                 [256, 128, 1, False]          
 14                -1  1      8320  yolov5.models.common.Conv               [128, 64, 1, 1]               
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  yolov5.models.common.Concat             [1]                           
 17                -1  1     22912  yolov5.models.common.C3                 [128, 64, 1, False]           
 18                -1  1     36992  yolov5.models.common.Conv               [64, 64, 3, 2]                
 19          [-1, 14]  1         0  yolov5.models.common.Concat             [1]                           
 20                -1  1     74496  yolov5.models.common.C3                 [128, 128, 1, False]          
 21                -1  1    147712  yolov5.models.common.Conv               [128, 128, 3, 2]              
 22          [-1, 10]  1         0  yolov5.models.common.Concat             [1]                           
 23                -1  1    296448  yolov5.models.common.C3                 [256, 256, 1, False]          
 24      [17, 20, 23]  1     33825  yolov5.models.yolo.Detect               [20, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
Model summary: 214 layers, 1790977 parameters, 1790977 gradients, 4.3 GFLOPs

Transferred 343/349 items from /home/rahul/.cache/sparsezoo/neuralmagic/yolov5-n-coco-base/training/model.pt
AMP: checks passed βœ…
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
train: Scanning '/network/datasets/VOC/labels/train2007.cache' images and labels... 16551 found, 0 missing, 0 empty, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16551/16551 [00:00<?, ?it/s]