ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.55k stars 16.3k forks source link

Training never started after scaning images/labels ! #4100

Closed thusinh1969 closed 3 years ago

thusinh1969 commented 3 years ago

I used a scaled-down limited version of OpenImage V6 images which has about 250,000 images.

** YAML file:

Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]

path: ../ # dataset root dir train: train # train images (relative to 'path') 128 images val: test # val images (relative to 'path') 128 images

Classes

nc: 66 # number of classes names: ['Ladder', 'Sink', 'Home appliance', 'Tent', 'Lantern', 'Stairs', 'Chair', 'Cabinetry', 'Bidet', 'Desk', 'Bronze sculpture', 'Fountain', 'Christmas tree', 'Studio couch', 'Wine rack', 'Couch', 'Door', 'Shower', 'Wardrobe', 'Tree house', 'Nightstand', 'Window blind', 'Bathtub', 'Houseplant', 'House', 'Ceiling fan', 'Sofa bed', 'Heater', 'Curtain', 'Bed', 'Fireplace', 'Bookcase', 'Refrigerator', 'Wood-burning stove', 'Filing cabinet', 'Table', 'Tableware', 'Porch', 'Billiard table', 'Bathroom cabinet', 'Mirror', 'Chest of drawers', 'Infant bed', 'Cupboard', 'Jacuzzi', 'Sculpture', 'Picture frame', 'Loveseat', 'Coffee table', 'Toilet', 'Countertop', 'Waste container', 'Swimming pool', 'Furniture', 'Bench', 'Window', 'Closet', 'Lamp', 'Flowerpot', 'Drawer', 'Stool', 'Shelf', 'Spice rack', 'Kitchen & dining room table', 'Dog bed', 'Cat furniture'] # class names

lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3) lrf: 0.2 # final OneCycleLR learning rate (lr0 * lrf) momentum: 0.937 # SGD momentum/Adam beta1 weight_decay: 0.0005 # optimizer weight decay 5e-4 warmup_epochs: 3.0 # warmup epochs (fractions ok) warmup_momentum: 0.8 # warmup initial momentum warmup_bias_lr: 0.1 # warmup initial bias lr box: 0.05 # box loss gain cls: 0.5 # cls loss gain cls_pw: 1.0 # cls BCELoss positive_weight obj: 1.0 # obj loss gain (scale with pixels) obj_pw: 1.0 # obj BCELoss positive_weight iou_t: 0.20 # IoU training threshold anchor_t: 4.0 # anchor-multiple threshold

anchors: 3 # anchors per output layer (0 to ignore)

fl_gamma: 0.0 # focal loss gamma (efficientDet default gamma=1.5) hsv_h: 0.015 # image HSV-Hue augmentation (fraction) hsv_s: 0.7 # image HSV-Saturation augmentation (fraction) hsv_v: 0.4 # image HSV-Value augmentation (fraction) degrees: 0.0 # image rotation (+/- deg) translate: 0.1 # image translation (+/- fraction) scale: 0.5 # image scale (+/- gain) shear: 0.0 # image shear (+/- deg) perspective: 0.0 # image perspective (+/- fraction), range 0-0.001 flipud: 0.0 # image flip up-down (probability) fliplr: 0.5 # image flip left-right (probability) mosaic: 1.0 # image mosaic (probability) mixup: 0.0 # image mixup (probability) copy_paste: 0.0 # segment copy-paste (probability)

## 🐛 Errors:

It starts, scanning and found some corrupt images/labels. And then hang right here. GPU took 2.5G, memory is not jumping and there is no activity in wanddb.

(Steve38_WIN) nguyen@hatto2:~/OpenImage/YoloV5/yolov5$ python train.py --batch 8 --img-size 640 --data ../steve_openimage.yaml --weights ../pretrained/yolov5x.pt --device 0 train: weights=../pretrained/yolov5x.pt, cfg=, data=../steve_openimage.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=300, batch_size=8, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache_images=False, image_weights=False, device=0, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=-1 github: ⚠️ WARNING: code is out of date by 1 commit. Use 'git pull' to update or 'git clone https://github.com/ultralytics/yolov5' to download latest. YOLOv5 🚀 v5.0-303-g3bef77f torch 1.7.1 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019.0625MB)

hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/ 2021-07-21 23:13:27.035115: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 wandb: Currently logged in as: hatto (use wandb login --relogin to force relogin) wandb: wandb version 0.11.0 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade 2021-07-21 23:13:31.765893: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 wandb: Tracking run with wandb version 0.10.28 wandb: Syncing run exp6 wandb: ⭐⭐️ View project at https://wandb.ai/hatto/YOLOv5 wandb: 🚀 View run at https://wandb.ai/hatto/YOLOv5/runs/26sf48h2 wandb: Run data is saved locally in /home/nguyen/OpenImage/YoloV5/yolov5/wandb/run-20210721_231329-26sf48h2 wandb: Run wandb offline to turn off syncing.

Overriding model.yaml nc=80 with nc=66

             from  n    params  module                                  arguments

0 -1 1 8800 models.common.Focus [3, 80, 3] 1 -1 1 115520 models.common.Conv [80, 160, 3, 2] 2 -1 1 309120 models.common.C3 [160, 160, 4] 3 -1 1 461440 models.common.Conv [160, 320, 3, 2] 4 -1 1 3285760 models.common.C3 [320, 320, 12] 5 -1 1 1844480 models.common.Conv [320, 640, 3, 2] 6 -1 1 13125120 models.common.C3 [640, 640, 12] 7 -1 1 7375360 models.common.Conv [640, 1280, 3, 2] 8 -1 1 4099840 models.common.SPP [1280, 1280, [5, 9, 13]] 9 -1 1 19676160 models.common.C3 [1280, 1280, 4, False] 10 -1 1 820480 models.common.Conv [1280, 640, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 5332480 models.common.C3 [1280, 640, 4, False] 14 -1 1 205440 models.common.Conv [640, 320, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 1335040 models.common.C3 [640, 320, 4, False] 18 -1 1 922240 models.common.Conv [320, 320, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 4922880 models.common.C3 [640, 640, 4, False] 21 -1 1 3687680 models.common.Conv [640, 640, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 19676160 models.common.C3 [1280, 1280, 4, False] 24 [17, 20, 23] 1 477759 models.yolo.Detect [66, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [320, 640, 1280]] Model Summary: 607 layers, 87681759 parameters, 87681759 gradients, 218.7 GFLOPs

Transferred 788/794 items from ../pretrained/yolov5x.pt Scaled weight_decay = 0.0005 Optimizer groups: 134 .bias, 134 conv.weight, 131 other albumentations: Blur(always_apply=False, p=0.1, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.1, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01) train: Scanning '../train/labels' images and labels...254739 found, 12 missing, 7791 empty, 375 corrupted: 100%|███████████████████████████████████████████| 254751/254751 [00:34<00:00, 7311.04it/s] train: WARNING: Ignoring corrupted image and/or label ../train/images/0071689b11f8a240.jpg: duplicate labels train: WARNING: Ignoring corrupted image and/or label ../train/images/0071689b11f8a240.jpg: duplicate labels train: WARNING: Ignoring corrupted image and/or label ../train/images/00b4a9339181a90b.jpg: duplicate labels


It hung right there !

Any help is appreciated. Steve

github-actions[bot] commented 3 years ago

👋 Hello @thusinh1969, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 3 years ago

@thusinh1969 for a large dataset it may take a few minutes to initialize the dataloaders. Does COCO training work for you?

python train.py --data coco.yaml
thusinh1969 commented 3 years ago

Found the bug: upgrade wandb :( !!! Works now.

Thanks, Steve