Inference is very slow, 6 seconds per frame.

antithing commented 4 years ago

Hi, and thank you for making this code available.

I am running it in windows, on a GTX 1080, and using the demo_image.py file with the model from google drive and the time it takes to detect keypoints is more than 6 seconds.

What am i doing wrong? How can i get close to the 38 fps that you mention on the readme?

Thank you again!


>python demo_image.py --image input.jpg
0 neck->nose
1 neck->Reye
2 neck->Leye
3 neck->Rear
4 neck->Lear
5 nose->Reye
6 nose->Leye
7 Reye->Rear
8 Leye->Lear
9 neck->Rsho
10 Rsho->Relb
11 Relb->Rwri
12 neck->Lsho
13 Lsho->Lelb
14 Lelb->Lwri
15 neck->Rhip
16 Rhip->Rkne
17 Rkne->Rank
18 neck->Lhip
19 Lhip->Lkne
20 Lkne->Lank
21 nose->Rsho
22 nose->Lsho
23 Rsho->Rhip
24 Rhip->Lkne
25 Lsho->Lhip
26 Lhip->Rkne
27 Rear->Rsho
28 Lear->Lsho
29 Rhip->Lhip
{0: 'neck->nose',
 1: 'neck->Reye',
 2: 'neck->Leye',
 3: 'neck->Rear',
 4: 'neck->Lear',
 5: 'nose->Reye',
 6: 'nose->Leye',
 7: 'Reye->Rear',
 8: 'Leye->Lear',
 9: 'neck->Rsho',
 10: 'Rsho->Relb',
 11: 'Relb->Rwri',
 12: 'neck->Lsho',
 13: 'Lsho->Lelb',
 14: 'Lelb->Lwri',
 15: 'neck->Rhip',
 16: 'Rhip->Rkne',
 17: 'Rkne->Rank',
 18: 'neck->Lhip',
 19: 'Lhip->Lkne',
 20: 'Lkne->Lank',
 21: 'nose->Rsho',
 22: 'nose->Lsho',
 23: 'Rsho->Rhip',
 24: 'Rhip->Lkne',
 25: 'Lsho->Lhip',
 26: 'Lhip->Rkne',
 27: 'Rear->Rsho',
 28: 'Lear->Lsho',
 29: 'Rhip->Lhip',
 30: 'nose',
 31: 'neck',
 32: 'Rsho',
 33: 'Relb',
 34: 'Rwri',
 35: 'Lsho',
 36: 'Lelb',
 37: 'Lwri',
 38: 'Rhip',
 39: 'Rkne',
 40: 'Rank',
 41: 'Lhip',
 42: 'Lkne',
 43: 'Lank',
 44: 'Reye',
 45: 'Leye',
 46: 'Rear',
 47: 'Lear',
 48: 'background',
 49: 'reverseKeypoint'}
Resuming from checkpoint ......
Network weights have been resumed from checkpoint...
cuda
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
start processing...
the 0th keypoint detection result is :  ([(384.98810766687865, 156.99848021452428), (392.0089789786089, 140.00016588448665), (372.00392927155144, 141.9994244210869), (396.997404715929, 137.00354114471122), (339.00678492184926, 140.0066329927729), (424.0065017794617, 191.99842561943024), (304.9960763460449, 220.00916854059585), (443.0001489242592, 272.0109579295975), (292.00050351624543, 310.9984260760411), (465.0083100132065, 350.99493035095674), (293.00562399904305, 404.00513994760007), (420.99916662586236, 393.0031377139439), (349.9987046664099, 401.00452761418853), (413.99545615615057, 536.0021693790678), (351.0002542695355, 541.9933765298466), (376.0021593526506, 644.988972815169), (352.00185668667876, 677.9945526718805)], 0.9674948892626798)
processing time is 6.45740

hellojialee commented 4 years ago

Hi, thank you for your question. Sorry for making you feel confused. The speed is not included in the motivations of our paper, and 38 fps only refers to the network forward inference and it is the upper bound. So is the keypoint assignment part. The repo is only a research prototype and I didn't do code acceleration and rebuilding as OpenPose did. Our rough code using single scale inference without flipping can run at about 2 fps.

antithing commented 4 years ago

Ah I see. Thank you for explaining it. :) Stay healthy in these crazy times.

hellojialee commented 4 years ago

It's really nice of you :) Wish you healthy and happy ALL the time!

sokunmin commented 4 years ago

@hellojialee Thanks for this great work!

I have refactored post-processing in a more intuitive way and added C++ acceleration, now it can run up to 7~8 fps using single scale with flipping.

Besides, I changed score calculation 1-1.0/score to score / count for per human in evaluation.py, which increases AP by 0.3 % in COCO minival set 2017.

You can check the results in my forked repo.

FYI.

hellojialee commented 4 years ago

@sokunmin Awesome work! Could I recommend your Repo in the README? My respect.

sokunmin commented 4 years ago

@hellojialee Yes and welcome if you don't mind. :) I've learnt a lot on your great work. Hope it also helps to those who are interested in it.

hellojialee commented 4 years ago

@sokunmin Many thanks! I feel excited that the prototype Repo may help others. Best wishes to you.

nicolasugrinovic commented 4 years ago

@sokunmin I have tried the code you posted but is takes around 10 secs per frame, way slower than stated. Also when I set the parameter --run_cpp I get the following error:

UnboundLocalError: local variable 'person_to_joint_assoc' referenced before assignment

Also, related to apex I get the following error:

Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")

sokunmin commented 4 years ago

@sokunmin I have tried the code you posted but is takes around 10 secs per frame, way slower than stated. Also when I set the parameter --run_cpp I get the following error:

UnboundLocalError: local variable 'person_to_joint_assoc' referenced before assignment

Also, related to apex I get the following error:

Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")

@nicolasugrinovic

Hi

This error happens from time to time. It is likely to be lifecycle issue. Variable person_to_joint_assoc was not clean or still inferenced and then following processed data from next image assign to it. But I still don't know how to fix it yet.

What I did was cleaning built folder and build cpp files again.

BTW, the size of IMHN backbone is huge. If you want a faster model, you can figure out some other ways to down size model(e.g.following Shufflenetv2 MobileNet design guidelines). This is gonna help speed up inference.

It seems you didn't install apex with CUDA extension supported. Checking out official github instruction may help.

hellojialee / Improved-Body-Parts

Inference is very slow, 6 seconds per frame. #20