About the GPU usage and training time of several methods in registration.

paul007pl / MVP_Benchmark

MVP Benchmark for Multi-View Partial Point Cloud Completion and Registration

https://mvp-dataset.github.io/

Apache License 2.0

116 stars 10 forks source link

About the GPU usage and training time of several methods in registration. #1

Closed MaxChanger closed 2 years ago

MaxChanger commented 2 years ago

Hello, I tried to test the code for point cloud registration in Benchmark, but I found that the GPU utilization for these methods is very low (jumping between 0% and 20%, and the training process is also very slow), I used multiple GPUs, do you use a single one for the training process or? Can you provide us with your hardware environment and the approximate time required for a full training or an epoch? I'm not sure if it's my environment or code problem, I'd like to get your help.

Two test environments: a). cuda10.2, GPU 2080Ti, pytorch1.5.0; b). cuda11.1, GPU 3090, pytorch1.7.0.

paul007pl commented 2 years ago

Hi, your observation can be all right. I think it is not a bug or a problem, just because of the following reasons:

Those methods either have many non-trainable oprations (e.g. knn, gathering. grouping, surface normal estimation, and etc.)
They are using complicated backbones (e.g. DGCNN with a large knn).
The input are two point clouds, and each of them has the size (BatchSize, 2048, 3) I use 2 V100 to train the DCP, and 1 V100 for the rest. Hope my answer helps you.

MaxChanger commented 2 years ago

Thank you for your reply, which allowed me to confirm that the code was officially tested and certified on multiple cards.

As far as I remember, the official DCP code training speed and GPU occupancy will be much higher than 20% (the exact value is somewhat forgotten due to too long time). Indeed I also agree with you that calling it a bug or a problem is not quite appropriate.
If possible, could you provide the approximate time it takes to fully train the DCP, under your experimental conditions.
My gpu server is recently doing high io operations, later I will test it in my environment and share it with you.

MaxChanger commented 2 years ago

Hi, @paul007pl. I tested the official dcp code on 4*2080Ti with batch_size=32 and pytorch1.5.0, it takes about 9.5h to run 100 epochs

As a comparison, I tested the dcp code in benchmark repo on 4*2080Ti with batch_size=16 (due to memory overflow), and it takes about 1h to run 1 epoch. I think the difference between the two is huge and the difference in dataset shouldn't cause such a big difference? I am eager to share the results with you, so I only ran it for a while. But I think it's enough to extrapolate the entire training phase consumption time. (The time required for each iteration is not stable, the shortest time is ~1h to complete an epoch.)

paul007pl commented 2 years ago

Thanks for your report, and I have imporved its training efficiency now. Please try again~
For DeepGMR, you can further improve its efficiency by using this file: https://github.com/wentaoyuan/deepgmr/blob/master/rri.cu.
Once I have time, I will further improve.

MaxChanger commented 2 years ago

Wow, your improvements are very effective, now the DCP training process can reach an approximate speed to the official one in my environment, thanks a lot!

But after comparing the dcp.py file, I found that it seems to be just a change in the code structure? Can you briefly describe the reason for doing this to help me better understand the reason for the efficiency improvement. Or where is the key to improved performance.

Thanks again.

paul007pl commented 2 years ago

Unfortunately, I do not know the exact reason... I assume it may be caused by the "clone" operation... You are encouraged to do another study, and maybe you can tell us later :)

MaxChanger commented 2 years ago

🤔 Philosophy of life, hahaha. Anyway, thanks again! 🍻