tjuskyzhang / Scaled-YOLOv4-TensorRT

Got 100fps on TX2. Got 500fps on GeForce GTX 1660 Ti. If the project is useful to you, please Star it.
178 stars 41 forks source link

Custom implementation with modified network (detections not as they should be) #17

Closed casper-hansen closed 3 years ago

casper-hansen commented 3 years ago

@tjuskyzhang the detections are not working in with the tiny 3l config.

All I have done is modify the CSP version in your repository to match my config file and then converted the weights successfully. I can serialize and deserialize, but the network's predictions are totally off and do not look like they should.

Can you possibly help with the 3-layer YOLO config? I can provide config and weights if needed

Your modified repository output:

image

Scaled YOLOv4 repository output

image

tjuskyzhang commented 3 years ago

Can you show me your code about "createEngine" and "yololayer.h"?

casper-hansen commented 3 years ago

@tjuskyzhang It would be great if this could work -- I'm running 24 FPS on a Jetson Nano with this config.

Here is createEngine. I believe it should be correct.

ICudaEngine *createEngine(unsigned int maxBatchSize, IBuilder *builder, IBuilderConfig *config, DataType dt)
{
    INetworkDefinition *network = builder->createNetworkV2(0U);

    // Create input tensor of shape {3, INPUT_H, INPUT_W} with name INPUT_BLOB_NAME
    ITensor *data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{3, INPUT_H, INPUT_W});
    assert(data);

    std::map<std::string, Weights> weightMap = loadWeights("../yolo_tiny_best.wts");

    // define each layer.
    auto l0 = convBnMish(network, weightMap, *data, 32, 3, 2, 1, 0);
    auto l1 = convBnMish(network, weightMap, *l0->getOutput(0), 64, 3, 2, 1, 1);
    auto l2 = convBnMish(network, weightMap, *l1->getOutput(0), 64, 3, 1, 1, 2);

    // route_lhalf
    ISliceLayer* l3= network->addSlice(*l2->getOutput(0),Dims3{0,0,0},Dims3{32, INPUT_W / 4, INPUT_H / 4},Dims3{1,1,1});

    auto l4 = convBnMish(network, weightMap, *l3->getOutput(0), 32, 3, 1, 1, 4);
    auto l5 = convBnMish(network, weightMap, *l4->getOutput(0), 32, 3, 1, 1, 5);
    ITensor* inputTensors6[] = {l5->getOutput(0), l4->getOutput(0)};
    auto cat6 = network->addConcatenation(inputTensors6, 2);
    auto l7 = convBnMish(network, weightMap, *cat6->getOutput(0), 64, 1, 1, 0, 7);

    // route all
    ITensor* inputTensors8[] = {l2->getOutput(0), l7->getOutput(0)};
    auto cat8 = network->addConcatenation(inputTensors8, 2);

    auto pool9 = network->addPoolingNd(*cat8->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
    pool9->setStrideNd(DimsHW{2, 2});
    auto l10 = convBnMish(network, weightMap, *pool9->getOutput(0), 128, 3, 1, 1, 10);

    // route_lhalf
    ISliceLayer* l11= network->addSlice(*l10->getOutput(0),Dims3{0,0,0},Dims3{64,INPUT_W / 8, INPUT_H / 8},Dims3{1,1,1});

    auto l12 = convBnMish(network, weightMap, *l11->getOutput(0), 64, 3, 1, 1, 12);
    auto l13 = convBnMish(network, weightMap, *l12->getOutput(0), 64, 3, 1, 1, 13);
    ITensor* inputTensors14[] = {l13->getOutput(0), l12->getOutput(0)};
    auto cat14 = network->addConcatenation(inputTensors14, 2);
    auto l15 = convBnMish(network, weightMap, *cat14->getOutput(0), 128, 1, 1, 0, 15);
    // route all
    ITensor* inputTensors16[] = {l10->getOutput(0), l15->getOutput(0)};
    auto cat16 = network->addConcatenation(inputTensors16, 2);

    auto pool17 = network->addPoolingNd(*cat16->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
    pool17->setStrideNd(DimsHW{2, 2});
    auto l18 = convBnMish(network, weightMap, *pool17->getOutput(0), 256, 3, 1, 1, 18);

    // route_lhalf
    ISliceLayer* l19= network->addSlice(*l18->getOutput(0),Dims3{0,0,0},Dims3{128,INPUT_W / 16, INPUT_H / 16},Dims3{1,1,1});

    auto l20 = convBnMish(network, weightMap, *l19->getOutput(0), 128, 3, 1, 1, 20);
    auto l21 = convBnMish(network, weightMap, *l20->getOutput(0), 128, 3, 1, 1, 21);
    ITensor* inputTensors22[] = {l21->getOutput(0), l20->getOutput(0)};
    auto cat22 = network->addConcatenation(inputTensors22, 2);
    auto l23 = convBnMish(network, weightMap, *cat22->getOutput(0), 256, 1, 1, 0, 23);

    // route all
    ITensor* inputTensors24[] = {l18->getOutput(0), l23->getOutput(0)};
    auto cat24 = network->addConcatenation(inputTensors24, 2);

    auto pool25 = network->addPoolingNd(*cat24->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
    pool25->setStrideNd(DimsHW{2, 2});
    auto l26 = convBnMish(network, weightMap, *pool25->getOutput(0), 512, 3, 1, 1, 26);

    // ---------
    auto l27 = convBnMish(network, weightMap, *l26->getOutput(0), 256, 1, 1, 0, 27);
    auto l28 = convBnMish(network, weightMap, *l27->getOutput(0), 512, 3, 1, 1, 28);
    IConvolutionLayer* conv29 = network->addConvolutionNd(*l28->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap["module_list.29.Conv2d.weight"], weightMap["module_list.29.Conv2d.bias"]);
    assert(conv29);
    // 30 is a yolo layer

    auto l31 = l27;
    auto l32 = convBnMish(network, weightMap, *l31->getOutput(0), 128, 1, 1, 0, 32);
    auto deconv33 = upSample(network, weightMap, *l32->getOutput(0), 128);

    ITensor* inputTensors34[] = {deconv33->getOutput(0), l23->getOutput(0)};
    auto cat34 = network->addConcatenation(inputTensors34, 2);

    auto l35 = convBnMish(network, weightMap, *cat34->getOutput(0), 256, 3, 1, 1, 35);
    IConvolutionLayer* conv36 = network->addConvolutionNd(*l35->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap["module_list.36.Conv2d.weight"], weightMap["module_list.36.Conv2d.bias"]);
    assert(conv36);
    // 37 is a yolo layer

    auto l38 = l35;
    auto l39 = convBnMish(network, weightMap, *l38->getOutput(0), 64, 1, 1, 0, 39);
    auto deconv40 = upSample(network, weightMap, *l39->getOutput(0), 64);

    ITensor* inputTensors41[] = {deconv40->getOutput(0), l15->getOutput(0)};
    auto cat41 = network->addConcatenation(inputTensors41, 2);

    auto l42 = convBnMish(network, weightMap, *cat41->getOutput(0), 128, 3, 1, 1, 42);

    IConvolutionLayer* conv43 = network->addConvolutionNd(*l42->getOutput(0), 3 * (Yolo::CLASS_NUM + 5), DimsHW{1, 1}, weightMap["module_list.43.Conv2d.weight"], weightMap["module_list.43.Conv2d.bias"]);
    assert(conv43);
    // 44 is a yolo layer

    auto creator = getPluginRegistry()->getPluginCreator("YoloLayer_TRT", "1");
    const PluginFieldCollection *pluginData = creator->getFieldNames();
    IPluginV2 *pluginObj = creator->createPlugin("yololayer", pluginData);
    ITensor *inputTensors_yolo[] = {conv29->getOutput(0), conv36->getOutput(0), conv43->getOutput(0)};
    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);

    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);
    std::cout << "set name out" << std::endl;
    network->markOutput(*yolo->getOutput(0));

    // Build engine
    builder->setMaxBatchSize(maxBatchSize);
    config->setMaxWorkspaceSize(16 * (1 << 20)); // 16MB
#ifdef USE_FP16
    config->setFlag(BuilderFlag::kFP16);
#endif
    ICudaEngine *engine = builder->buildEngineWithConfig(*network, *config);
    std::cout << "build out" << std::endl;

    // Don't need the network any more
    network->destroy();

    // Release host memory
    for (auto &mem : weightMap)
    {
        free((void *)(mem.second.values));
    }

    return engine;
}

Secondly, here is the yololayer.h

#ifndef _YOLO_LAYER_H
#define _YOLO_LAYER_H

#include <assert.h>
#include <cmath>
#include <string.h>
#include <cublas_v2.h>
#include "NvInfer.h"
#include "Utils.h"
#include <iostream>

namespace Yolo
{
    static constexpr int CHECK_COUNT = 3;
    static constexpr float IGNORE_THRESH = 0.1f;
    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;
    static constexpr int CLASS_NUM = 3;
    static constexpr int INPUT_H = 416;
    static constexpr int INPUT_W = 416;

    struct YoloKernel
    {
        int width;
        int height;
        float anchors[CHECK_COUNT*2];
    };

    static constexpr YoloKernel yolo1 = {
        INPUT_W / 8,
        INPUT_H / 8,
        {12,16, 19,36, 40,28}
    };
    static constexpr YoloKernel yolo2 = {
        INPUT_W / 16,
        INPUT_H / 16,
        {36,75, 76,55, 72,146}
    };
    static constexpr YoloKernel yolo3 = {
        INPUT_W / 32,
        INPUT_H / 32,
        {142,110, 192,243, 459,401}
    };

    static constexpr int LOCATIONS = 4;
    struct alignas(float) Detection{
        //x y w h
        float bbox[LOCATIONS];
        float det_confidence;
        float class_id;
        float class_confidence;
    };
}

namespace nvinfer1
{
    class YoloLayerPlugin: public IPluginV2IOExt
    {
        public:
            explicit YoloLayerPlugin();
            YoloLayerPlugin(const void* data, size_t length);

            ~YoloLayerPlugin();

            int getNbOutputs() const override
            {
                return 1;
            }

            Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override;

            int initialize() override;

            virtual void terminate() override {};

            virtual size_t getWorkspaceSize(int maxBatchSize) const override { return 0;}

            virtual int enqueue(int batchSize, const void*const * inputs, void** outputs, void* workspace, cudaStream_t stream) override;

            virtual size_t getSerializationSize() const override;

            virtual void serialize(void* buffer) const override;

            bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override {
                return inOut[pos].format == TensorFormat::kLINEAR && inOut[pos].type == DataType::kFLOAT;
            }

            const char* getPluginType() const override;

            const char* getPluginVersion() const override;

            void destroy() override;

            IPluginV2IOExt* clone() const override;

            void setPluginNamespace(const char* pluginNamespace) override;

            const char* getPluginNamespace() const override;

            DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const override;

            bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const override;

            bool canBroadcastInputAcrossBatch(int inputIndex) const override;

            void attachToContext(
                    cudnnContext* cudnnContext, cublasContext* cublasContext, IGpuAllocator* gpuAllocator) override;

            void configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput) override;

            void detachFromContext() override;

        private:
            void forwardGpu(const float *const * inputs,float * output, cudaStream_t stream,int batchSize = 1);
            int mClassCount;
            int mKernelCount;
            std::vector<Yolo::YoloKernel> mYoloKernel;
            int mThreadCount = 256;
            void** mAnchor;
            const char* mPluginNamespace;
    };

    class YoloPluginCreator : public IPluginCreator
    {
        public:
            YoloPluginCreator();

            ~YoloPluginCreator() override = default;

            const char* getPluginName() const override;

            const char* getPluginVersion() const override;

            const PluginFieldCollection* getFieldNames() override;

            IPluginV2IOExt* createPlugin(const char* name, const PluginFieldCollection* fc) override;

            IPluginV2IOExt* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override;

            void setPluginNamespace(const char* libNamespace) override
            {
                mNamespace = libNamespace;
            }

            const char* getPluginNamespace() const override
            {
                return mNamespace.c_str();
            }

        private:
            std::string mNamespace;
            static PluginFieldCollection mFC;
            static std::vector<PluginField> mPluginAttributes;
    };

};

#endif 
tjuskyzhang commented 3 years ago

try: static constexpr YoloKernel yolo1 = { INPUT_W / 8, INPUT_H / 8, {142, 110, 192, 243, 459, 401}}; static constexpr YoloKernel yolo2 = { INPUT_W / 16, INPUT_H / 16, {36, 75, 76, 55, 72, 146}}; static constexpr YoloKernel yolo3 = { INPUT_W / 32, INPUT_H / 32, {12, 16, 19, 36, 40, 28}};

casper-hansen commented 3 years ago

@tjuskyzhang I have now tried your update. It was a step in the right direction. The results are looking better with bigger bounding boxes.

Do you have any more suggestions? Please find my full code attached here. Let me know if you need anything else.

yolov4-tiny-3l-tensorrt.zip

Here are the detection result:

image

tjuskyzhang commented 3 years ago

try: static constexpr YoloKernel yolo1 = { INPUT_W / 32, INPUT_H / 32, {142, 110, 192, 243, 459, 401}}; static constexpr YoloKernel yolo2 = { INPUT_W / 16, INPUT_H / 16, {36, 75, 76, 55, 72, 146}}; static constexpr YoloKernel yolo3 = { INPUT_W / 8, INPUT_H / 8, {12, 16, 19, 36, 40, 28}};

If it doesn't work, please provide me with your weights and test picture.

casper-hansen commented 3 years ago

@tjuskyzhang incredible how so small changes can yield such a big effect!

The results are now perfect -- exactly the same as the Scaled YOLOv4 repository.

Thank you so much for the help. I can do a pull request on this config if you want.

image