How can I convert a MPS tensor to float32 for yolov5?

KronbergE commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I'm trying to utilize my Macbooks GPU when training my yolov5 model. I have already installed all the necessary things to utilize the GPU, I get the correct "('12.5', ('', '', ''), 'arm64')" and so on when calling on platform.mac_ver().

But when im executing my train command I get this error message:

!python train.py --device mps --img 640 --cfg /Users/myname/Desktop/yolov5/models/modifiedYolov5s.yaml --hyp /Users/myname/Desktop/yolov5/data/hyps/hyp.scratch.yaml --batch 32 --epochs 10 --data /Users/myname/Desktop/yolov5/data/pavementDistressDetectionSwedishData2.yaml --weights /Users/myname/Desktop/yolov5/runs/train/modDistressDetectorImprovedV0SwedishFixed/weights/best.pt --workers 8 --name modDistressDetectorImprovedV2SwedishData2

Traceback (most recent call last): File "/Users/myname/Desktop/yolov5/train.py", line 666, in main(opt) File "/Users/myname/Desktop/yolov5/train.py", line 561, in main train(opt.hyp, opt, device, callbacks) File "/Users/myname/Desktop/yolov5/train.py", line 285, in train model.class_weights = labels_to_class_weights(dataset.labels, nc).to(device) * nc # attach class weights TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

I've searched around but cannot figure out how to change so that pytorch uses float32 instead of float64.

Does anyone have any tips of what I might try?

Additional

No response

glenn-jocher commented 2 years ago

@KronbergE I think has been resolved in master a few weeks ago, can you verify you are seeing the problem in current master code?

The only place I find any remaining references to float64 in the repo is this line: https://github.com/ultralytics/yolov5/blob/628c05ca6ff1d7f79d1fc63c298008a1341ba99c/utils/dataloaders.py#L481

glenn-jocher commented 2 years ago

@KronbergE good news 😃! Your original issue may now be fixed ✅ in PR #8865. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

KronbergE commented 2 years ago

UPDATE

Thank you @glenn-jocher for your response as you must have a pretty packed schedule, it's much appreciated :D

I updated to the newest git repository and also created a new environment with the latest requirements. But now when trying to train with the MPS I instead get this error message.

!python train.py --device mps --img 640 --cfg /Users/myname/Desktop/yolov5/models/modifiedYolov5s.yaml --hyp /Users/myname/Desktop/yolov5/data/hyps/hyp.scratch-low.yaml --batch 16 --epochs 10 --data /Users/myname/Desktop/yolov5/data/pavementDistressDetectionSwedishData2.yaml --weights /Users/myname/Desktop/yolov5/runs/train/modDistressDetectorImprovedV0SwedishFixed/weights/best.pt --workers 8 --name modDistressDetectorImprovedV2SwedishData2

[34m[1mtrain:[0mweights=/Users/myname/Desktop/yolov5/runs/train/modDistressDetectorImprovedV0SwedishFixed/weights/best.pt, cfg=/Users/myname/Desktop/yolov5/models/modifiedYolov5s.yaml, data=/Users/myname/Desktop/yolov5/data/pavementDistressDetectionSwedishData2.yaml, hyp=/Users/myname/Desktop/yolov5/data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=mps, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=modDistressDetectorImprovedV2SwedishData2, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
[34m[1mgithub: [0mup to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v6.1-362-g731a2f8 Python-3.10.4 torch-1.13.0.dev20220804 MPS

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
YOLOv5 temporarily requires wandb version 0.12.10 or below. Some features may not work as expected.
Overriding model.yaml nc=80 with nc=8

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    115712  models.common.C3                        [128, 128, 2]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]                 
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1     35061  models.yolo.Detect                      [8, [[19, 10, 51, 12, 31, 29], [134, 13, 79, 30, 60, 64], [291, 28, 130, 81, 197, 140]], [128, 256, 512]]
modifiedYolov5s summary: 270 layers, 7041205 parameters, 7041205 gradients, 16.0 GFLOPs

Transferred 348/349 items from /Users/myname/Desktop/yolov5/runs/train/modDistressDetectorImprovedV0SwedishFixed/weights/best.pt
/Users/myname/Desktop/yolov5/utils/general.py:833: UserWarning: The operator 'aten::nonzero' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:11.)
  x = x[xc[xi]]  # confidence
[34m[1mAMP: [0mchecks failed ❌, disabling Automatic Mixed Precision. See https://github.com/ultralytics/yolov5/issues/7908
Scaled weight_decay = 0.0005
[34m[1moptimizer:[0m SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
[34m[1mtrain: [0mScanning '/Users/myname/Desktop/yolov5/swedishData2/labels/train.ca[0m
[34m[1mval: [0mScanning '/Users/myname/Desktop/yolov5/swedishData2/labels/val.cache'[0m
Plotting labels to runs/train/modDistressDetectorImprovedV2SwedishData2/labels.jpg... 

[34m[1mAutoAnchor: [0m4.23 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to [1mruns/train/modDistressDetectorImprovedV2SwedishData2[0m
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/59 [00:00<?, ?it/s]                                          [34m[1mwandb[0m: Currently logged in as: [33merikyolo[0m. Use [1m`wandb login --relogin`[0m to force relogin
  0%|          | 0/59 [00:02<?, ?it/s]                                          
Traceback (most recent call last):
  File "/Users/myname/Desktop/yolov5/train.py", line 634, in <module>
    main(opt)
  File "/Users/myname/Desktop/yolov5/train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "/Users/myname/Desktop/yolov5/train.py", line 310, in train
    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
  File "/Users/myname/Desktop/yolov5/utils/loss.py", line 125, in __call__
    tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets
  File "/Users/myname/Desktop/yolov5/utils/loss.py", line 208, in build_targets
    t = t[j]  # filter
NotImplementedError: The operator 'aten::index.Tensor_out' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

And when adding PYTORCH_ENABLE_MPS_FALLBACK=1 to the command I instead get this error message.

!PYTORCH_ENABLE_MPS_FALLBACK=1 python train.py --device mps --img 640 --cfg /Users/myname/Desktop/yolov5/models/modifiedYolov5s.yaml --hyp /Users/myname/Desktop/yolov5/data/hyps/hyp.scratch-low.yaml --batch 16 --epochs 10 --data /Users/myname/Desktop/yolov5/data/pavementDistressDetectionSwedishData2.yaml --weights /Users/myname/Desktop/yolov5/runs/train/modDistressDetectorImprovedV0SwedishFixed/weights/best.pt --workers 8 --name modDistressDetectorImprovedV2SwedishData2

Traceback (most recent call last):
  File "/Users/myname/Desktop/yolov5/train.py", line 634, in <module>
    main(opt)
  File "/Users/myname/Desktop/yolov5/train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "/Users/myname/Desktop/yolov5/train.py", line 310, in train
    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
  File "/Users/myname/Desktop/yolov5/utils/loss.py", line 125, in __call__
    tcls, tbox, indices, anchors = self.build_targets(p, targets)  # targets
  File "/Users/myname/Desktop/yolov5/utils/loss.py", line 208, in build_targets
    t = t[j]  # filter
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

glenn-jocher commented 2 years ago

@KronbergE yes this is expected. I would head over the pytorch issue mentioned and add your vote to the ops we need prioritized for conversion, aten::index.Tensor_out

maverick-ai commented 2 years ago

The reference for float64 is there in autoanchor.

AutoAnchor: 2.80 anchors/target, 0.947 Best Possible Recall (BPR). Anchors are a poor fit to dataset ⚠️, attempting to improve... AutoAnchor: WARNING: Extremely small objects found: 410 of 8227 labels are < 3 pixels in size AutoAnchor: Running kmeans for 9 anchors on 8224 points... AutoAnchor: Evolving anchors with Genetic Algorithm: fitness = 0.8163: 100%|██████████| 1000/1000 [00:00<00:00, 1252.70it/s] AutoAnchor: thr=0.25: 0.9993 best possible recall, 7.51 anchors past thr AutoAnchor: n=9, img_size=64, metric_all=0.477/0.816-mean/best, past_thr=0.536-mean: 4,3, 6,6, 10,7, 7,10, 12,11, 9,19, 15,17, 13,27, 17,25 Traceback (most recent call last): File "/Users/sarthakbansal/Desktop/ObjectDetection/yolov5/train.py", line 633, in <module> main(opt) File "/Users/sarthakbansal/Desktop/ObjectDetection/yolov5/train.py", line 529, in main train(opt.hyp, opt, device, callbacks) File "/Users/sarthakbansal/Desktop/ObjectDetection/yolov5/train.py", line 225, in train check_anchors(dataset, model=model, thr=hyp['anchor_t'], imgsz=imgsz) File "/Users/sarthakbansal/Desktop/ObjectDetection/yolov5/utils/autoanchor.py", line 58, in check_anchors anchors = torch.tensor(anchors, device=m.anchors.device).type_as(m.anchors) TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

glenn-jocher commented 2 years ago

@maverick-ai is this reproducible in current master? We used to have some float64 variables in YOLOv5 but they've all since been removed completely from the repo.

maverick-ai commented 2 years ago

Yes, I am using the current master branch

maverick-ai commented 2 years ago

@glenn-jocher The issue is still in yolov5/utils/autoanchor.py. I am pasting the screenshot of my terminal for your reference

glenn-jocher commented 2 years ago

@maverick-ai can you debug and see exactly what variable is float64?

glenn-jocher commented 2 years ago

@KronbergE @maverick-ai good news 😃! Your original issue may now be fixed ✅ in PR #9188. To receive this update:

Git – git pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
Notebooks – View updated notebooks
Docker – sudo docker pull ultralytics/yolov5:latest to update your image

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

ultralytics / yolov5