hht1996ok / EA-LSS

EA-LSS: Edge-aware Lift-splat-shot Framework for 3D BEV Object Detection
Apache License 2.0
224 stars 16 forks source link

Low mAP and NDS performance when eval camera stream full-training checkpoint with lower input size. #28

Open lufanma opened 3 months ago

lufanma commented 3 months ago

@hht1996ok Thanks for very sota work and code publication !!!

In the original setting of camera stream, input image size (exactly img_scale) is (1600, 896).

I conducted experiments on V100*4,original input setting will cause cuda out of memory, so I reduce the input size to half of that (exactly 800x448) and also update the related codes that generate multi-view depth map from lidar pc projection in _ealss.py and ealss_cam.py extractfeat func. Still generate the original RGB img size depth map of (1600, 896) scale.

After 20 epochs full training, the evaluation metrics are very low (mAP 11.55, NDS 17.43 only), seems abnormal. I guess it's cause from the low-resolution feature map (exactly C4 level feature of original size) are used, instead of the original setting (exactly C2 level feature of original size).

@hht1996ok Could u please give me some instructions about the above question, thanks very very much !!! Looking forward to ur reply.

lufanma commented 3 months ago

Also, the above setting support single batchsize training on V100 at most.

hht1996ok commented 3 months ago

@lufanma Thank you for your attention ! This metric looks strange, after reducing the image input in the experiment you have the option of modifying final_dim or reducing the self.dtransform by one layer of downsampling in cam_stream_lss.py to ensure that the feature sizes are aligned with the depth map sizes.

lufanma commented 3 months ago

@lufanma Thank you for your attention ! This metric looks strange, after reducing the image input in the experiment you have the option of modifying final_dim or reducing the self.dtransform by one layer of downsampling in cam_stream_lss.py to ensure that the feature sizes are aligned with the depth map sizes.

Wow, thanks for quick reply!!! Since the camera intrinsic used in the img2lidar transform is exactly corresponds to the original image coordinates, so I think that the depth map size should be the same as that of original RGB size. Otherwise, the camera intrinsic does not correspond to the image coordinate.

So I keep the final_dim (1600, 896) and self.dtransform unchanged, since fpnc will fuse different feature scales to the 8 downsample of final_dim. In this way, the feature sizes are aligned with the depth map sizes after self.dtransform (8 downsample).

lufanma commented 3 months ago

@hht1996ok Is my understanding above correct?

lufanma commented 3 months ago

I change the related codes in ealss.py and ealss_cam.py, like this: `# original

depth = torch.zeros(batch_size, img_size[1], 1, img_size[3], img_size[4]).cuda() # 创建大小 # (B, 6, 1, H, W)

        # generate original img size corresponding depth map
        ogfH, ogfW = self.final_dim
        depth_H, depth_W = ogfH - 4, ogfW # H=896, W=1600
        depth = torch.zeros(batch_size, img_size[1], 1, depth_H, depth_W).cuda() # 创建大小 # (B, 6, 1, 896, 1600)`

and the whether projected in image clip: `# original

on_img = (

            #     (cur_coords[..., 0] < img_size[3])
            #     & (cur_coords[..., 0] >= 0)
            #     & (cur_coords[..., 1] < img_size[4])
            #     & (cur_coords[..., 1] >= 0)
            # ) # (6, N) 0/1 matrix

            # cur_coords shape is (6, N, 2) yx
            on_img = (
                    (cur_coords[..., 0] < depth_H)
                    & (cur_coords[..., 0] >= 0)
                    & (cur_coords[..., 1] < depth_W)
                    & (cur_coords[..., 1] >= 0)
            )  # (6, N) 0/1 matrix`
lufanma commented 3 months ago

@hht1996ok Hi~ haotian, could u please answer my question? ths

hht1996ok commented 3 months ago

@lufanma This looks right, you should keep the depth map dimensions aligned with the feature dimensions.

Da1symeeting1 commented 1 month ago

@hht1996ok I have problems.

  1. I used the weights to train the cam branch and in bevfsuion(peiking) the mAP is 34.79 it seems right, and i set load_img_from = the weights i trained from bevfusion(cam) the results are bad loss mantain about 9, and i use the mask~.pth the loss was about 4 and it seems not to decrease .
  2. I have 4 A6000 48g, and i can't just run the code because of the memory, so i just reduce the img_scale(800,488), and i saw in bevfusion it is 800x448 also , so i just modify the self.dtransform. maybe that is the problem. i tried the code above it still 1600896 , so should i make the final_dim and the img_scale both 800448?
Da1symeeting1 commented 1 month ago

image i have tried make the final_dim = 800x448 img_scale=448x800, keep the downsample unchanged , and self.dransform unchanged ,but the loss is so high .

rubbish001 commented 1 month ago

我没有跑过他实验的都看出你的问题,heatmap第三轮损失太大了,就算是纯图像的都没有你这么大

rubbish001 commented 1 month ago

你用的是bev pool v1还是v2,如果是v2的话,速度是不是很慢

Da1symeeting1 commented 1 month ago

你用的是bev pool v1还是v2,如果是v2的话,速度是不是很慢

用的是 这个作者用的bevpool应该是v1,我根据bevfusion的config img_scale 448x800 更改了 depth的大小 以及upsample dtransform目前情况是这样的 PixPin_2024-07-09_18-20-06

Da1symeeting1 commented 1 month ago

@lufanma 请问你得到的mAP是多少呢,我修改了img_scale尺寸为800448,然后更改了相应的depth也为800448,跑出来之后mAP只有0.3199

rubbish001 commented 1 month ago

兄弟你有卡吗?有卡的话,我带你冲cvpr25,刷一篇时序的,我现在单帧图像+点云融合 test集刷到了73.9%mAP,准备投aaai24,在我这篇的基础上弄一个时序的出来,提供卡就可以挂名。怎么样。这个点数和sparseLIF只差0.5%mAP,他的图像是21帧,我的图像仅仅一帧,有很大概率能刷赢他的。

rubbish001 commented 1 month ago

哎,时序的估计要大显存才能玩,点子我都想好了,就是没有卡来做实验。

Da1symeeting1 commented 1 month ago

是啊,我连这个EALSS我都复现不了,48g不够用

rubbish001 commented 1 month ago

48G够了啊,我这个aaai24用4090刷的,下篇准备用slowfast的套路刷

rubbish001 commented 1 month ago

是啊,我连这个EALSS我都复现不了,48g不够用,中了就公布代码,看我的效果

https://github.com/rubbish001/Co-Fix3d