tensorRT推理引擎 - Githubissues

chenxyyy commented 3 years ago

大佬你好，我跑通了你的Scaled Yolov4项目。有个问题想要请教一下。

在运行

./ScaledYOLOv4_trt ../config-p5.yaml ../samples/

以后，生成了yolov4-p5.trt engine文件。

我现在想要使用yolov4-p5.trt做推理，所以我写了如下代码：

def detect_video(engine_path, file_path, image_size, view):
    with get_engine(engine_path) as engine, engine.create_execution_context() as context:
        buffers = allocate_buffers(engine, 1)
        IN_IMAGE_H, IN_IMAGE_W = image_size
        context.set_binding_shape(0, (1, 3, IN_IMAGE_H, IN_IMAGE_W))
        num_classes = 80
        namesfile = 'tools/coco.names'
        class_names = load_class_names(namesfile)
        cap = cv2.VideoCapture("2_Channel1.avi")

        while True:
            ret, img = cap.read()
            if ret is False:
                break
            # ============================= process =====================================
            resized = cv2.resize(img, (IN_IMAGE_W, IN_IMAGE_H), interpolation=cv2.INTER_LINEAR)
            img_in = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
            img_in = np.transpose(img_in, (2, 0, 1)).astype(np.float32)
            img_in = np.expand_dims(img_in, axis=0)
            img_in /= 255.0
            img_in = np.ascontiguousarray(img_in)
            # ============================= tensorRT =====================================
            inputs, outputs, bindings, stream = buffers
            inputs[0].host = img_in
            trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

我想问的是，在我使用.trt文件得到推理结果以后，他是一个长度为5597760的列表，我如何通过它，拿到我所需要的[score, class, box]信息呢？

linghu8812 commented 3 years ago

https://github.com/linghu8812/tensorrt_inference/blob/887cca1487395cc46a23537213201d224600a976/ScaledYOLOv4/ScaledYOLOv4.cpp#L243-L257

这里有解bbox的代码

chenxyyy commented 3 years ago

不好意思，是我没仔细看您的代码。

我现在直接采用您的cpp程序，略微修改了一下来直接测试视频文件的效果

void ScaledYOLOv4::EngineInferenceVideo(const std::string &video_name, bool view, const int &outSize, void **buffers,
                                   const std::vector<int64_t> &bufferSize, cudaStream_t stream) {
    int index = 0;
    int batch_id = 0;
    std::vector<cv::Mat> vec_Mat(BATCH_SIZE);
    std::vector<std::string> vec_name(BATCH_SIZE);
    float total_time = 0;
    VideoCapture cap(video_name);
    if (!cap.isOpened())
    {
        return;
    }
    Mat frame;
    while (1)
    {
        cap >> frame;
        if (frame.empty()) break;
        vec_Mat[batch_id] = frame.clone();

        auto t_start_pre = std::chrono::high_resolution_clock::now();
        std::vector<float>curInput = prepareImage(vec_Mat);
        cudaMemcpyAsync(buffers[0], curInput.data(), bufferSize[0], cudaMemcpyHostToDevice, stream);
        // do inference
        context->execute(BATCH_SIZE, buffers);
        auto *out = new float[outSize * BATCH_SIZE];
        cudaMemcpyAsync(out, buffers[1], bufferSize[1], cudaMemcpyDeviceToHost, stream);
        cudaStreamSynchronize(stream);
        auto boxes = postProcess(vec_Mat, out, outSize);
        auto t_end_pre = std::chrono::high_resolution_clock::now();
        float total_pre = std::chrono::duration<float, std::milli>(t_end_pre - t_start_pre).count();
        std::cout << "Inference take: " << total_pre << " ms." << std::endl;

        auto org_img = vec_Mat[0];
        if (!org_img.data)
            continue;
        auto rects = boxes[0];
        for(const auto &rect : rects)
        {
            char t[256];
            sprintf(t, "%.2f", rect.prob);
            std::string name = detect_labels[rect.classes] + "-" + t;
            cv::putText(org_img, name, cv::Point(rect.x - rect.w / 2, rect.y - rect.h / 2 - 5), cv::FONT_HERSHEY_COMPLEX, 0.7, class_colors[rect.classes], 2);
            cv::Rect rst(rect.x - rect.w / 2, rect.y - rect.h / 2, rect.w, rect.h);
            cv::rectangle(org_img, rst, class_colors[rect.classes], 2, cv::LINE_8, 0);
        }
        if (view){
            imshow("1", org_img);
        }
        delete[] out;
        if (waitKey(1) >= 0)
            break;
    }
}

可是速度并没有明显的提升，大约40ms左右。我使用的是机器是2080Ti，原始的Scaled Yolov4的速度是要小于40ms的。请问大佬知道是什么原因吗？

linghu8812 commented 3 years ago

40ms是推理引擎的推理时间还是所有的时间？pytorch的Scaled Yolov4是fp16精度的，TensorRT也是fp16精度的，可以试试int8精度。

chenxyyy commented 3 years ago

40ms 包含 prepareImage 、inference、 postProcess。

原始的（没有tensorRT加速）Scaled yolov4 p5的时间大约是30ms左右。

为何tensorRT反而还变慢了。。不得其解，哈哈哈。

大佬试过int8吗，通常int8的效果精度会下降很多，校准的帮助也是有限的。

linghu8812 commented 3 years ago

应该是postProcess的处理时间太长了，Scaled yolov4最后输出的rows太多，postProcess和nms还可以优化一下。

chenxyyy commented 3 years ago

我又试了一下Yolov4的量化，整体时间在18ms左右，有一定的提升。

大佬可以参考一下 https://github.com/jkjung-avt/tensorrt_demos

我用这里面的方法，测试的yolov4，速度可以达到6.8ms一帧。

AderonHuang commented 3 years ago

https://github.com/linghu8812/tensorrt_inference/blob/887cca1487395cc46a23537213201d224600a976/ScaledYOLOv4/ScaledYOLOv4.cpp#L243-L257

这里有解bbox的代码

请问有python版本的吗

AderonHuang commented 3 years ago

大佬你好，我跑通了你的Scaled Yolov4项目。有个问题想要请教一下。

在运行

./ScaledYOLOv4_trt ../config-p5.yaml ../samples/

以后，生成了yolov4-p5.trt engine文件。

我现在想要使用yolov4-p5.trt做推理，所以我写了如下代码：

def detect_video(engine_path, file_path, image_size, view):
    with get_engine(engine_path) as engine, engine.create_execution_context() as context:
        buffers = allocate_buffers(engine, 1)
        IN_IMAGE_H, IN_IMAGE_W = image_size
        context.set_binding_shape(0, (1, 3, IN_IMAGE_H, IN_IMAGE_W))
        num_classes = 80
        namesfile = 'tools/coco.names'
        class_names = load_class_names(namesfile)
        cap = cv2.VideoCapture("2_Channel1.avi")

        while True:
            ret, img = cap.read()
            if ret is False:
                break
            # ============================= process =====================================
            resized = cv2.resize(img, (IN_IMAGE_W, IN_IMAGE_H), interpolation=cv2.INTER_LINEAR)
            img_in = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB)
            img_in = np.transpose(img_in, (2, 0, 1)).astype(np.float32)
            img_in = np.expand_dims(img_in, axis=0)
            img_in /= 255.0
            img_in = np.ascontiguousarray(img_in)
            # ============================= tensorRT =====================================
            inputs, outputs, bindings, stream = buffers
            inputs[0].host = img_in
            trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

我想问的是，在我使用.trt文件得到推理结果以后，他是一个长度为5597760的列表，我如何通过它，拿到我所需要的[score, class, box]信息呢？

楼主跟你遇到同样的问题，最后你用python取成功了吗

chenxyyy commented 3 years ago

@AderonHuang 我没用python做。大佬的C++程序写的挺好的，参照着应该很好实现

AderonHuang commented 3 years ago

@AderonHuang 我没用python做。大佬的C++程序写的挺好的，参照着应该很好实现我python的tensorrt输出来是一个1843200的列表，怎么把他reshape成feature map或者怎么获取得到框你，c++不太熟悉，可以解答下么

AderonHuang commented 3 years ago

应该是postProcess的处理时间太长了，Scaled yolov4最后输出的rows太多，postProcess和nms还可以优化一下。

请问可视化后，画2d框的线显示有些隐隐约约显示在图像其他地方，请问是Opencv的问题吗

henbucuoshanghai commented 3 years ago

请问有多个输出的时候怎么办呢？求助下面只是一个输出 cudaMemcpyAsync(out, buffers[1], bufferSize[1], cudaMemcpyDeviceToHost, stream);

linghu8812 / tensorrt_inference

tensorRT推理引擎 #33