The confusion about results of 3DSSD between official and MMDet3D implementation.

Physu commented 3 years ago

Thanks for developers extraordinary work! I have a question about 3DSSD evaluation result between author and MMDet3D implementation. The author's release result:

Methods	Easy AP	Moderate AP	Hard AP	Models
3DSSD	91.71	83.30	80.44	model
PointRCNN	88.91	79.88	78.37	model

In MMDet3D, the result:

Backbone	Class	Lr schd	Mem (GB)	Inf time (fps)	mAP	Download
PointNet2SAMSG	Car	72e	4.7		78.39(81.00)1	model \| log

I notice "Experiment details on KITTI datasets", which shows the difference between official implementation.

1.Official implementation based on Tensorflow1.4, but I guess pytorch is not the reason of poor performance, or tensorflow and pytorch exist performance gap? 2.It is about two percent margin(81.0 and 83.3) between two implementation, can we come up with some methods to fix it?

I also use single2080Ti to train a train+val model with configs/3DSSD/3dssd_kitti-3d-car.py, I modified the ann_file=data_root + 'kitti_infos_train.pkl', to ann_file=data_root + 'kitti_infos_trainval.pkl', the rest code was kept as origin. when the train was finished, I will evaluate on test, and get the result there to discuss. Thanks again!

Physu commented 3 years ago

I train the 3DSSD followed the configs in configs/3dssd/3dssd_kitti-3d-car.py with train+val data and modified the batchsize from 4 to 8, modified the lr from 0.002 to 0.004, the rest keep as origin. The test result(under AP40):

Benchmark	Easy	Moderate	Hard
Car (Detection)	94.91 %	91.35 %	87.47 %
Car (Orientation)	0.01 %	0.47 %	0.63 %
Car (3D Detection)	86.06 %	76.48 %	69.71 %
Car (Bird's Eye View)	91.65 %	86.69 %	81.05 %

There exits a large margin between the official 3DSSD(76.48 vs 79.55). I feel confused about this, did I set something wrong? Or what can I do to make up this performance gap? Thanks

Tai-Wang commented 3 years ago

The reason for the performance difference has been explained in the README page. Among the differences, there are two most important ones: different evaluation code and different train/val set. The first one can yield about 2 mAP difference as said in the readme while the second one will at least remove the influence of false positive predictions in those samples without ground truths.

In addition, we also regress the benchmark by evaluating our results with their evaluation code and evaluating their results with our evaluation code. The results are almost the same. (Actually, we only reproduce the 79.26 mAP with the official code according to the record of @encore-zhou.)

As for the difference on the test set, there exist some uncertainty and tricks. Have you ever tried to train a model with the official code and submit the result to the benchmark?

Physu commented 3 years ago

Thanks for your feedback! Official code was implemented by Tensorflow, I will try train a model and submit the result to the test and evaluate the performance. New results will be updated here, as soon as I get it.

Physu commented 3 years ago

By the way, 79.26 is evaluated on val data or test data? If result was evaluated on test data, 79.26 vs 79.55(official in test data), the margin is acceptable. My result on test data was 3 mAP margin, it is unacceptable.

Actually, we only reproduce the 79.26 mAP with the official code according to the record of @encore-zhou

Tai-Wang commented 3 years ago

It's evaluated on their val dataset and with their evaluation code (compared with the reported 83.3). So I guess there is a large range of fluctuation in terms of performance on the validation set. You can have a try first and let's have a closer look into whether there is a gap between our implementation and the official one.

Physu commented 3 years ago

Got it, I will try to reproduce result by following official code.

Physu commented 3 years ago

I use the official implementation and configs train models in Docker container. The python packages are listed below: tensorflow 1.4.0 tensorflow-tensorboard 0.4.0 python 3.5 cuda 9.0 numpy 1.14.5

total train iterations: 80700 final ckptfile: model-79893(not 80700 as final ckptfile)

the result: the model-79893 the model-79086 the model-78279 the model-77472

Benchmark	iterations	Easy	Moderate	Hard
Car (Detection)	77472	89.70 %	82.84 %	79.97 %
Car (Detection)	78279	89.29 %	82.69 %	80.06 %
Car (Detection)	79086	91.14 %	82.79%	80.02 %
Car (Detection)	79893	89.39 %	82.54%	79.83 %

It seems official model evaluation results are better than MMdet3D, but the reason needs further study to find out.

Tai-Wang commented 3 years ago

It's a little strange because when we reproduce 3DSSD, @encore-zhou only got the following performance with the official code:

Maybe there is some fluctuation in performance?

Physu commented 3 years ago

Maybe the author improved the code implementation? There is something cause the performance gap. I will check the 3DSSD head, hope can find something to explain this situation.

And this is new results gotten by minutes ago.

Physu commented 3 years ago

By the way, this results is trained with more epoch, can see that the performance further improved( reach 82.9%).

Tai-Wang commented 3 years ago

Yes, it is really strange because we reproduce the above results on Aug. 2020 (as shown in the screenshot) and there are no updates after April 2020. We will check this issue recently. In the meantime, if you have any progress, please feel free to share it here.

Physu commented 3 years ago

Thanks for reopening this issue! New findings will be updated.

Physu commented 3 years ago

Using pytorch1.5 mmdet 1.3.9 mmdet3d 0.14.0 mmcv-full 1.3.9 ubuntu 18.4

I use official configs in configs/3DSSD/3dssd_4x4_kitti-3d-car.py and modified the single GPU batchsize from 4 to 8(because I use 2 GPUs, the official config setting is 4 GPUs), lr_rate and epoch keep as origin. Trained a model with 2 2080Ti GPUs and with Full train data(7481). Finally I get a validation results on valsplit (3, 769 samples): Then I generate test submission file, and submit it to test server: The performance is not good as I expected, I just don't know why. Could you please give some opinions on this performance?

Physu commented 3 years ago

I find it is hard to reproduce the results on KITTI test, though you could have gotten a good result on val already.

Physu commented 3 years ago

If we set the confidence threshold great than 0.0(default, output all the plausible predictions), e.g. 0.2 to filter the final predictions in predictions_in_test.txt, we will get: . Note that in configs, you can define your threshold:

test_cfg=dict(
        nms_cfg=dict(type='nms', iou_thr=0.1),
        sample_mod='spec',
        score_thr=0.0,  # Attention!!!
        per_class_proposal=True,
        max_output_num=100))

Though there is some improvement, it is far from 79.57 in moderate (3DSSD in leaderboard). I guess a good post processing is needed，but the other skills which can improve performance are sitll a question.

Wuziyi616 commented 3 years ago

@Physu Have you ever tried generating submission using the official code and submit it to the test server to see the test set result? Also, it seems to me that, changing mmdet3d's training batch and GPUs from 4x4 to 8x2 improves val set results a lot?

Please kindly provide more observations and I will try to look into this issue.

Physu commented 3 years ago

@Wuziyi616 Thanks for your attention! Does offcial code mean dvlab-research/3DSSD or other methods? Besides, in order to learn more about the evaluation procedure, I use traveller59/kitti-object-eval-python to test results on val set(e.g. get every LiDAR bin's results and save it in a txt file, finally get 3769 txt files). I find when no other post processing involved, the results: which is slightly better than mmdet3d evaluation result(Maybe it is unfair to compare this way, for the hyperparameters may change): If I use a confidence threshold 0.2 to filter out the false positive, the result further improved:

Physu commented 3 years ago

I will reproduce on 4*4, and we will see the difference further.

Wuziyi616 commented 3 years ago

@Wuziyi616 Thanks for your attention! Does offcial code mean dvlab-research/3DSSD or other methods? Besides, in order to learn more about the evaluation procedure, I use traveller59/kitti-object-eval-python to test results on val set(e.g. get every LiDAR bin's results and save it in a txt file, finally get 3769 txt files). I find when no other post processing involved, the results: which is slightly better than mmdet3d evaluation result(Maybe it is unfair to compare this way, for the hyperparameters may change): If I use a confidence threshold 0.2 to filter out the false positive, the result further improved:

Exactly, the official code I said is the dvlab's code. I think that's the official code release for 3DSSD isn't it? As you mentioned in this reply, you said you would like to submit test results using that code, have you done that?

Physu commented 3 years ago

Thanks for your attention，my opportunity is running out, the results will be updated soon.

jlqzzz commented 3 years ago

@Physu Have you tried to reproduce the multi-class version of 3dssd (that is, predict car, pedestrian and cyclist at the same time)?

Machine97 commented 2 years ago

@Physu Hi, have you ever tried generating submission using the official code and submit it to the test server to see the test set result?

open-mmlab / mmdetection3d

The confusion about results of 3DSSD between official and MMDet3D implementation. #612