junjie18 / CMT

[ICCV 2023] Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
Other
307 stars 34 forks source link

The reported FPS is inconsistent with previous work #13

Open Pai-Shu opened 1 year ago

Pai-Shu commented 1 year ago

Hi, thank you for the open-source code.

In the DeepInteraction paper (https://arxiv.org/pdf/2208.11112.pdf), they also provide the inference speed on the A100 GPU, where the FPS of TransFusion is 6.2 and the FPS of DeepInteraction is 4.9. I also tested by myself and got similar results with them.

However, in your paper, also on A100 GPU, the FPS of TransFusion is about 3.2 and the FPS of DeepInteraction is about 1.6. I understand that there are differences in the machine but this gap is too large.

Can you provide more details about how you measure the FPS and which parts of the code are taken into account?

Thank you very much for your answer.

junjie18 commented 1 year ago

Hi Pai-Shu,

Thanks for your concern. I am sorry I don't know how their FPS is calculated. I have released my speed script, and all the FPS statistics are calculated using open repos with this script.

Pai-Shu commented 1 year ago

Thank you for this kind reply. I have checked this script and noticed that you take the voxelization time into consideration. Differently, previous papers (DeepInteraction and BEVFusion) do not consider voxelization (https://github.com/mit-han-lab/bevfusion/issues/14).

I tested the inference time (seconds) of CMT and TransFusion on my GPUs: w/ voxelization: CMT: 0.32 vs TransFusion: 0.39 (CMT faster) w/o voxelization: CMT: 0.29 vs TransFusion: 0.20 (TransFusion faster)

Is it the fact that CMT mainly has an advantage in the voxelization process? or do I have some misunderstanding?

junjie18 commented 1 year ago

Thanks for your notification.

I do not know what your GPU is, whatever, if you test on A100, I think you can obtain same results as mine. I think it's because modern GPUs focus much more on Transformer acceleration.

By the way, CMT only privide a very naive demo that demonstrate DETR head is work on fusion tasks while still maintain high speed. If you are interested in model speed, there are a lot of works in transformer acceleration, for example, most tokens in CMT is abundant especially point cloud tokens.

AlmoonYsl commented 1 year ago

Thank you for this kind reply. I have checked this script and noticed that you take the voxelization time into consideration. Differently, previous papers (DeepInteraction and BEVFusion) do not consider voxelization (mit-han-lab/bevfusion#14).

I tested the inference time (seconds) of CMT and TransFusion on my GPUs: w/ voxelization: CMT: 0.32 vs TransFusion: 0.39 (CMT faster) w/o voxelization: CMT: 0.29 vs TransFusion: 0.20 (TransFusion faster)

Is it the fact that CMT mainly has an advantage in the voxelization process? or do I have some misunderstanding?

The voxelization process of TransFusion used the cuda ops of mmdet3d (v0.11.0). I found that CMT used the spconv-cu111 to voxelize point clouds (well optimized), which may be the reason for the difference in voxelization speed of CMT and TransFusion.

Pai-Shu commented 1 year ago

Thanks for the replies. My GPU is not A100. There may be some differences in the inference speeds, but the relative relationship may be unchanged since the gap is significant. Maybe you can provide your inference speed without voxelization on A100.

I also notice the difference in voxelization. It is of course important to adopt the newest techniques. However, from my perspective, this voxelization optimization can be combined with ANY detectors including most previous work. I do not think it is fair to compare the FPS of different methods directly if you take voxelization into consideration.

HuangJunJie2017 commented 9 months ago

@Pai-Shu I also found that the comparison with previous work is unfair to a great extent. BEVFusion(MIT)-sttiny-voxel0075 with spconv-cu111 can be run fast as 10 FPS on 3090, close to that of cmt-r50-voxel01 on 3090 but with better performance。

HuangJunJie2017 commented 9 months ago

@junjie18 大佬,重点不在什么卡上,A100和3090上结论是一致得~小作坊A100不多,3090用着趁手 而且已经在做正事啦,不然我才懒得管你们 最近就是在写论文,你们这样搞不公平对比,我们引用对比时就很尴尬 一篇是CMT,还有SparseFusion也一样,ICCV2023梅开二度 我们是在错误的道路上将就,还是统一用spconv-cu111把之前算法都重新测速? 新的算法也不可能不用spconv-cu111,在bevfusion(mit)上改出来的,不用怎么比得过CMT和SparseFusion? voxelize用不用cuda加速这个对整体速度影响太大了 用了吧,怎么跟审稿人和读者解释,为什么我们的时效对比图和你们的有那么大的出入?直接说CMT和SparseFusion的图都是不公平对比,我们的才是公平的?这么说得尴尬的扣脚 大佬你说我该怎么写?

HuangJunJie2017 commented 9 months ago

@junjie18 而且量产也要搞多模态大模型,我们奔着论文上时效对比图去用得CMT,训了发现和bevfusion没啥区别,细查发现Fast全靠spconv-cu111 烧的卡都打水漂了,老郁闷了

junjie18 commented 9 months ago

@HuangJunJie2017 求一份大佬的测速脚本

HuangJunJie2017 commented 9 months ago

@junjie18 这是我的测速脚本,和mmdet3d的稍有差别,主要是workers_per_gpu=0 防止数据读取线程占用资源

# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import time

import torch
from mmcv import Config
from mmcv.parallel import MMDataParallel
from mmcv.runner import load_checkpoint

from mmdet3d.datasets import build_dataloader, build_dataset
from mmdet3d.models import build_detector

def parse_args():
    parser = argparse.ArgumentParser(description='MMDet benchmark a model')
    parser.add_argument('config', help='test config file path')
    parser.add_argument('--checkpoint', default=None, help='checkpoint file')
    parser.add_argument('--samples', default=200, help='samples to benchmark')
    parser.add_argument(
        '--log-interval', default=50, help='interval of logging')
    parser.add_argument(
        '--fuse-conv-bn',
        action='store_true',
        help='Whether to fuse conv and bn, this will slightly increase'
        'the inference speed')
    parser.add_argument(
        '--no-acceleration',
        action='store_true',
        help='Omit the pre-computation acceleration')
    args = parser.parse_args()
    return args

def main():
    args = parse_args()

    cfg = Config.fromfile(args.config)
    # set cudnn_benchmark
    if cfg.get('cudnn_benchmark', False):
        torch.backends.cudnn.benchmark = True
    cfg.model.pretrained = None
    cfg.data.test.test_mode = True

    # build the dataloader
    # TODO: support multiple images per gpu (only minor changes are needed)
    dataset = build_dataset(cfg.data.test)
    data_loader = build_dataloader(
        dataset,
        samples_per_gpu=1,
        workers_per_gpu=0,
        dist=False,
        shuffle=False)

    # build the model and load checkpoint
    if not args.no_acceleration:
        cfg.model.img_view_transformer.accelerate=True # use bevpoolv2 and precompute the indexes
    cfg.model.train_cfg = None
    model = build_detector(cfg.model, test_cfg=cfg.get('test_cfg'))

    if args.checkpoint is not None:
        load_checkpoint(model, args.checkpoint, map_location='cpu')

    model = MMDataParallel(model, device_ids=[0])

    model.eval()

    # the first several iterations may be very slow so skip them
    num_warmup = 5
    pure_inf_time = 0

    # benchmark with several samples and take the average
    for i, data in enumerate(data_loader):

        torch.cuda.synchronize()
        start_time = time.perf_counter()

        with torch.no_grad():
            model(return_loss=False, rescale=True, **data)

        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start_time

        if i >= num_warmup:
            pure_inf_time += elapsed
            if (i + 1) % args.log_interval == 0:
                fps = (i + 1 - num_warmup) / pure_inf_time
                print(f'Done image [{i + 1:<3}/ {args.samples}], '
                      f'fps: {fps:.1f} img / s')

        if (i + 1) == args.samples:
            pure_inf_time += elapsed
            fps = (i + 1 - num_warmup) / pure_inf_time
            print(f'Overall \nfps: {fps:.2f} img / s '
                  f'\ninference time: {1000 / fps:.2f} ms')
            break

if __name__ == '__main__':
    main()
junjie18 commented 9 months ago

@HuangJunJie2017 我用git的原装repo,然后把voxelize都改成了CMT同款,然后在A100上用大佬的脚本跑了200 sample,得到以下结果:

model FPS(sample/s) AP
CMT-VOV 6.4 70.3
CMT-R50 14.2 67.9
Transfusion 6.5 67.5
BEVFusion 6.2 68.5
junjie18 commented 9 months ago

@HuangJunJie2017 感觉应该是符合预期的,我们注意到你对bevfusion做的优化,比如bevpool v2之类的,可能比我们测的原始bevfusion要快一些。 如果大佬你觉得这个结果有问题,可以进一步指出。 我们确认无误后会在arxiv paper里进行更新。

HuangJunJie2017 commented 9 months ago

@junjie18 BEVFusion 的速度没达到预期,原文在3090上延迟也只有119ms,A100上不会低于这个速度 直接用我提供的脚本测他们repo,论文里的precomputation是没做的,这个应该是你们推理速度和原文的差别比较大的原因 原文的bevpool和我们的bevpoolv2速度在256x704这种分辨率下,差别是很小的,都是在1ms内讨论,具体可以看我们那个bevpoolv2的论文 image

@HuangJunJie2017 我用git的原装repo,然后把voxelize都改成了CMT同款,然后在A100上用大佬的脚本跑了200 sample,得到以下结果:

model FPS(sample/s) AP CMT-VOV 6.4 70.3 CMT-R50 14.2 67.9 Transfusion 6.5 67.5 BEVFusion 6.2 68.5

HuangJunJie2017 commented 9 months ago

@junjie18 你想想BEVFusion就一个imagebackbone,一个ptsbackbone, 特征直接送到transfusionhead, BEVFusion里的transfusionhead比 CMT简单多了,viewtransform也基本不耗时,测出来比CMT慢很多就肯定有问题

junjie18 commented 9 months ago

@HuangJunJie2017 precompute似乎指的就是voxelization?如果是是不是和官方结果大致对上了。 然后bevfusion实验里用的是swint,可能是这个原因拖慢了速度?CMT r50的实验我把整个图砍小了,voxel size是0.1的,backbone+小 把整个速度打上去的。 主要是我用的是官方codebase,我也没精力去仔细看是什么原因了。。希望大佬帮帮忙

HuangJunJie2017 commented 9 months ago

@junjie18 precomputation指viewtransform里面这geom_feats 和kept两个变量只和内外参相关,和当前的帧输入图像/点云都无关,可以提前计算,推理时只做固定参数或者输入即可~ https://github.com/mit-han-lab/bevfusion/blob/601961a903c46d14ece4606b8acfe86e604499df/mmdet3d/models/vtransforms/base.py#L169C9-L169C20

junjie18 commented 9 months ago

@HuangJunJie2017 大佬方便提供一个precompute的ablation吗

junjie18 commented 9 months ago

@HuangJunJie2017 另外感觉CMT是不是一样的,inference没有bda,坐标的PE都可以预处理,这个我也没做, 都是online计算的。

HuangJunJie2017 commented 9 months ago

@HuangJunJie2017 另外感觉CMT是不是一样的,inference没有bda,坐标的PE都可以预处理,这个我也没做.

应该会有差别吧,主要是Lift-Splat-Shoot 计算的点数达到256/8*704/8x6camx120(depth dim)=200k,这么多点从增广后的图像空间转换到ego坐标系表示的三维空间,还有排序,过滤的操作,所以特别慢,至于CMTHead里的点数和计算复杂度么仔细看

junjie18 commented 9 months ago

@HuangJunJie2017 https://github.com/junjie18/CMT/blob/master/projects/mmdet3d_plugin/models/dense_heads/cmt_head.py#L417 放一个,我这里好几个embed操作,都是坐标变换过后过mlp,因为embedding和feature都是decouple的,这些都能离线算。

HuangJunJie2017 commented 9 months ago

@HuangJunJie2017 https://github.com/junjie18/CMT/blob/master/projects/mmdet3d_plugin/models/dense_heads/cmt_head.py#L417 放一个,我这里好几个embed操作,都是坐标变换过后过mlp,这些都能离线算。

这个看起来是有优化空间的

HuangJunJie2017 commented 9 months ago

@junjie18 还有一点我之前也忽略了,如果CMT用了flash-attn,对比时BEVFusion也要用flash-attn,sttiny还有head里都有attention计算~

junjie18 commented 9 months ago

@HuangJunJie2017 flash attn原理上是减少大块的访存来加速,不支持attn mask。因此对于swin这种小窗口无效,且无法适用于transfusion多模head。Bevfusion的transfusion head我单独测了一下,加flash attn后FPS是6.3,基本没变。

另外CMT目前使用的还是flashattn 1,flashattn 2应该还有一个明显加速。而且也没上pre compute,我们也不大愿意花很多精力在这个上面。