TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

xugaoxiang commented 2 years ago

Search before asking

[X] I have searched the Yolov7_StrongSORT_OSNet issues and discussions and found no similar questions.

Yolov7_StrongSORT_OSNet Component

Tracking

Bug

(pytorch1.7) PS D:\Github\Yolov7_StrongSORT_OSNet> python track.py --source .\test.mp4 --strong-sort-weights osnet_x0_25_market1501.pt D:\Github\Yolov7_StrongSORT_OSNet\strong_sort/deep/reid\torchreid\metrics\rank.py:11: UserWarning: Cython evaluation (very fast so highly recommended) is unavailable, now use python evaluation. warnings.warn( Fusing layers... RepConv.fuse_repvgg_block RepConv.fuse_repvgg_block RepConv.fuse_repvgg_block Model: osnet_x0_25

params: 203,568
flops: 82,316,000 Successfully loaded pretrained weights from "osnet_x0_25_market1501.pt" ** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias'] (1, 256, 128, 3) img = letterbox(img0, self.img_size, stride=self.stride)[0] File "D:\Github\Yolov7_StrongSORT_OSNet\yolov7\utils\datasets.py", line 1000, in letterbox dw, dh = np.mod(dw, stride), np.mod(dh, stride) # wh padding File "C:\Users\xgx\Anaconda3\envs\pytorch1.7\lib\site-packages\torch\tensor.py", line 630, in array return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Environment

v1.0 osnet_x0_25_market1501 windows 10 64bit python 3.8 pytorch 1.7.1 + cu101

Minimal Reproducible Example

python track.py --source .\test.mp4 --strong-sort-weights osnet_x0_25_market1501.pt

Zhengzhiyang0000 commented 2 years ago

Have you solved this problem？

yagelgen commented 2 years ago

same here when running on cuda on linux

mikel-brostrom commented 2 years ago

Sorry, cannot reproduce this error on Linux

yagelgen commented 2 years ago

I'm working on AWS EC2 type g4dn.xlarge.

I ran:

python track.py --source v.mp4 --yolo-weights yolov7-e6e.pt --img 1280

And I got:

Downloading https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-e6e.pt to yolov7-e6e.pt...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 290M/290M [00:19<00:00, 15.4MB/s]

Fusing layers... 
Downloading...
From: https://drive.google.com/uc?id=1Kkx2zW89jq_NETu4u42CFZTMVD5Hwm6e
To: /home/ec2-user/Yolov7_StrongSORT_OSNet/weights/osnet_x0_25_msmt17.pt
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.34M/9.34M [00:00<00:00, 17.9MB/s]
Model: osnet_x0_25
- params: 203,568
- flops: 82,316,000
Successfully loaded pretrained weights from "/home/ec2-user/Yolov7_StrongSORT_OSNet/weights/osnet_x0_25_msmt17.pt"
** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias']
(1, 256, 128, 3)
video 1/1 (1/1100) /home/ec2-user/Yolov7_StrongSORT_OSNet/v.mp4: Traceback (most recent call last):
  File "/home/ec2-user/Yolov7_StrongSORT_OSNet/track.py", line 332, in <module>
    main(opt)
  File "/home/ec2-user/Yolov7_StrongSORT_OSNet/track.py", line 327, in main
    run(**vars(opt))
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/Yolov7_StrongSORT_OSNet/track.py", line 149, in run
    for frame_idx, (path, im, im0s, vid_cap) in enumerate(dataset):
  File "/home/ec2-user/Yolov7_StrongSORT_OSNet/yolov7/utils/datasets.py", line 191, in __next__
    img = letterbox(img0, self.img_size, stride=self.stride)[0]
  File "/home/ec2-user/Yolov7_StrongSORT_OSNet/yolov7/utils/datasets.py", line 1000, in letterbox
    dw, dh = np.mod(dw, stride), np.mod(dh, stride)  # wh padding
  File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/_tensor.py", line 732, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

P.S. It works if I run with CPU, and also it works on this VM with YOLOV5-STRONGSORT.

THX (:

mikel-brostrom commented 2 years ago

git pull and try again, please @yagelgen . Let's if I manged to fix it now. Still can't reproduce this behavior on a newly cloned repo by:

python track.py --source v.mp4 --yolo-weights yolov7-e6e.pt --img 1280 --device 0

yagelgen commented 2 years ago

@mikel-brostrom Same error. Did you checked it with aws ec2 g4dn?

(If you want, we can schedule like half hour zoom to try to fix it.)

mikel-brostrom commented 2 years ago

Have not tried to deploy this on any cloud platform. I am available 11-12AM CET tomorrow. Otherwise, Wednesday 8-12.

Zhengzhiyang0000 commented 2 years ago

I solved the problem. You can try under this file File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/_tensor.py", line 732, in array return self.numpy() modify self.numpy() to self.cpu().numpy

Zhengzhiyang0000 commented 2 years ago

modify self.numpy() to self.cpu().numpy() After I revised it, there was no error reported

Zhengzhiyang0000 commented 2 years ago

you can try it

yagelgen commented 2 years ago

@Zhengzhiyang0000 yeah! now it works.

@mikel-brostrom do you know how to fix it in the code?

(If you want I'm available tomorrow - you can set half hour in google calendar - yagelgen@gmail.com)

mikel-brostrom commented 2 years ago

I solved the problem.
You can try under this file File "/home/ec2-user/.local/lib/python3.9/site-packages/torch/_tensor.py", line 732, in array
return self.numpy()
modify self.numpy() to self.cpu().numpy

Your fix is within torch @Zhengzhiyang0000? That is wierd

xugaoxiang commented 2 years ago

@mikel-brostrom

dataset = LoadImages(source, img_size=imgsz, stride=stride.cpu().numpy())

instead of

dataset = LoadImages(source, img_size=imgsz, stride=stride)

But, it's too slow.

Jimmeimetis commented 2 years ago

I fixed it by changing

stride = model.stride.max()

to

stride = int(model.stride.max())

in track.py line 105 and also removing the .cpu().numpy() in the same file

StrongSORT is still very slow in itself so I see no application for it in real time scenarios (~ 0.1 seconds for just strongsort per frame on a 1660ti mobile while my custom trained yolov7 tiny needs an order of magnitude less than that. )..

mikel-brostrom commented 2 years ago

I achieve the following inference times on my webcam with a modest Quadro P2000. Which is way below a 1660ti in terms of specs @Jimmeimetis.

Yolov5s.pt + mobilenetv2_x1_0_msmt17.pt

0: 480x640 1 person, 3 cars, Done. YOLO:(0.024s), StrongSORT:(0.047s)
0: 480x640 1 person, 5 cars, Done. YOLO:(0.019s), StrongSORT:(0.031s)
0: 480x640 1 person, 5 cars, Done. YOLO:(0.018s), StrongSORT:(0.032s)
0: 480x640 1 person, 5 cars, Done. YOLO:(0.019s), StrongSORT:(0.030s)
0: 480x640 1 person, 4 cars, Done. YOLO:(0.018s), StrongSORT:(0.027s)
0: 480x640 1 person, 4 cars, Done. YOLO:(0.018s), StrongSORT:(0.027s)
0: 480x640 1 person, 4 cars, Done. YOLO:(0.019s), StrongSORT:(0.025s)

~20FPS

Yolov5s.engine + mobilenetv2_x1_0_msmt17.engine

0: 640x640 1 class0, 2 class2s, Done. YOLO:(0.018s), StrongSORT:(0.018s)
0: 640x640 1 class0, 3 class2s, Done. YOLO:(0.019s), StrongSORT:(0.020s)
0: 640x640 1 class0, 3 class2s, Done. YOLO:(0.017s), StrongSORT:(0.020s)
0: 640x640 1 class0, 3 class2s, Done. YOLO:(0.019s), StrongSORT:(0.020s)
0: 640x640 1 class0, 2 class2s, Done. YOLO:(0.018s), StrongSORT:(0.017s)
0: 640x640 1 class0, 2 class2s, Done. YOLO:(0.018s), StrongSORT:(0.016s)
0: 640x640 1 class0, 2 class2s, Done. YOLO:(0.018s), StrongSORT:(0.017s)
0: 640x640 1 class0, 2 class2s, Done. YOLO:(0.017s), StrongSORT:(0.017s)

~27FPS

Notice that my main work is in my Yolov5StrongSORT repo which is currently ahead of Yolov7StrongSORT.

Jimmeimetis commented 2 years ago

These look much more reasonable given the GFLOPS of the models used in StrongSORT, lots of weird behavior on my turing GPU (1660ti) compared to my pascal one (1070). Cuda 11 makes my 1660ti detect nothing on yolov7 and on cuda 10.2 that im running as a workaround , fp16 is significantly slower vs fp32 .

Also thanks for letting me know about your work on the yolov5 repo. Will test it later!

Jimmeimetis commented 2 years ago

Ok tested it, StrongSORT run time is proper on the yolov5 repo so I will use that implementation or v7. Lastly, just disabling half precision on my cuda11 environment with the 1660ti seems to do the trick inference wise (it now detects).

Will test it on a 3090 soon enough in attempt to try and find the culprit. Thanks!

mikel-brostrom commented 2 years ago

Notice that the more detection you have the longer time it will take for StrongSORT to finish the association process. Btw, I don't think 1660ti supports half precision inference...

Jimmeimetis commented 2 years ago

Btw, I don't think 1660ti supports half precision inference...

It does and the issue is likely some poor interaction with pytorch/cuda. Even if it didn't support accelerated fp16 at 2x the rate of fp32 the performance should have been roughly the same and not degraded ~10x like it is on my side. I will get to the bottom of this eventually but its not a priority right now.

https://www.nvidia.com/en-us/geforce/news/geforce-gtx-1660-ti-advanced-shaders-streaming-multiprocessor/

Thanks and have a good night

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

NQHuy1905 commented 1 year ago

@Jimmeimetis have you found the culprit of this issue, i am using 1660 too and strongsort process 0.2s per frame, pretty slow

Jimmeimetis commented 1 year ago

@NQHuy1905 I ported the strongsort tracker from the v5 repo to the v7 and the execution times lined up with the v5 ones. That being said, while it was able to run in real time using a very fast inference model I did not consider it being worth using over deepsort due to the higher execution time as is(even used significantly smaller models for strong sort and still it wasn't good enough for my standards)..

The porting i did of the code + testing actually took place the following day from my last post here. I did it as fast as possible to get the results i needed so the changes are somewhat poorly made.

Either way if you want to try it, I can try uploading the project somewhere this weekend

NQHuy1905 commented 1 year ago

@Jimmeimetis So you mean the reason of high execution time is because strongsort. I haven't try deepsort with yolov7 but have you tried and did execution time is lower? I tried tracker with v5 and v7 repo of smaller yolo and strongsort models and it wasn't good enough for my standards too

Jimmeimetis commented 1 year ago

@NQHuy1905 Yes I have been running yolo v7 and v8 with deepsort. It has its own problems but at this point I don't have the time yet to dive into other trackers. There are public repos out there that have paired v7 with deepsort if you want to try

mikel-brostrom / Yolov7_StrongSORT_OSNet