GPU implementation of MSPD and MSSD

Hi @thodan and @MartinSmeyer,

In CPU implementation, we iterate over num_estimates x num_gt to calculate the errors for MSSD and MSPD. In GPU implementation, we iterate over num_object x num_gt which makes the run-time faster, particularly when num_estimates is large (e.g with 50K estimates, it is 3x faster). I didn’t modify eval_calc_scores.py.

The current implementation limits GPU usage to 1.0 GB and can output results for 50K detections within 6 minutes. Note that the GPU implementation always runs with 1 worker, as batching in GPU serves the same purpose as multiprocessing for improving run-time.

As usual, I reproduced the scores of MegaPose for 6D localization tasks to make sure the scores do not change.

run_time_localization_tasks

Here is the benchmarking for run-time in 6D detection task:

run_time_detection_tasks

Thanks @MedericFourmy for finding the bugs!

thodan / bop_toolkit

GPU implementation of MSPD and MSSD #127