Error in network layer extraction in GPU mode

zhengthomastang commented 6 years ago

I followed the example of test_detector() in examples/detector.c to make a C++ interface for YOLO. The code is compiled using g++ and C++11 (the darknet.h file and libdarknet.so file must be in the same folder): g++ main.cpp -L. -ldarknet -o main -std=c++11

It works perfectly in CPU mode (GPU=0, CUDNN=0), however, it fails in GPU mode (GPU=1, CUDNN=1). The problem occurs in network layer extraction which is at line 36: layer l = net->layers[net->n-1];

In both modes, net->n-1 always returns 31. In CPU mode, the extracted network layer appears to be normal, with the size of l.w==19, l.h==19 and l.n==5 for data/dog.jpg. But in GPU mode, the extracted network layer is empty, with the size of l.w==0, l.h==0 and l.n==0. This error will lead to segmentation fault at line 60.

My complete code is attached as follows.

extern "C"
{
#include "darknet.h"
}
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#include <algorithm>
#include <iostream>
#include <sstream>
#include <fstream>

using namespace std;

int main(){
    float thresh = .5, hier_thresh = .5;
    metadata meta=get_metadata("cfg/coco.data");
    network *net = load_network("cfg/yolo.cfg", "yolo.weights", 0);
    set_batch_network(net, 1);
    srand(2222222);
    double time;
    char buff[256];
    char *input = buff;
    int j;
    float nms=.3;
    while(1){
        printf("Enter Image Path: ");
        fflush(stdout);
        input = fgets(input, 256, stdin);
        if(!input) return 0;
        strtok(input, "\n");
        //detect(net, meta, input);
        image im = load_image_color(input,0,0);
        image sized = letterbox_image(im, net->w, net->h);

        layer l = net->layers[net->n-1];
        int num=l.w * l.h * l.n;
        box *boxes = static_cast<box *>(calloc(num, sizeof(box)));
        float **probs = static_cast<float **>(calloc(num, sizeof(float *)));
        for(j = 0; j < num; ++j) probs[j] = static_cast<float *>(calloc(l.classes + 1, sizeof(float *)));
        float **masks = 0;
        if (l.coords > 4){
            masks = static_cast<float**>(calloc(num, sizeof(float*)));
            for(j = 0; j < num; ++j) masks[j] = static_cast<float*>(calloc(l.coords-4, sizeof(float *)));
        }

        float *X = sized.data;
        time=what_time_is_it_now();
        network_predict(net, X);
        printf("%s: Predicted in %f seconds.\n", input, what_time_is_it_now()-time);
        get_region_boxes(l, im.w, im.h, net->w, net->h, thresh, probs, boxes, masks, 0, 0, hier_thresh, 1);
        if (nms) do_nms_sort(boxes, probs, num, l.classes, nms);

        int i,j;

        for(i = 0; i < num; ++i){
            char labelstr[4096] = {0};
            int cls = -1;
            for(j = 0; j < meta.classes; ++j){
                if (probs[i][j] > 0){
                    printf("%s: %.3f, %.3f, %.3f, %.3f, %.0f%%\n", meta.names[j],boxes[i].h, boxes[i].w, boxes[i].x, boxes[i].y, probs[i][j]*100);
                }
            }
        }

        free_image(im);
        free_image(sized);
        free(boxes);
        free_ptrs((void **)probs, l.w*l.h*l.n);
    }

    return 0;
}

hsinyahsinya commented 6 years ago

hi~ I am also the newest user and get this problem in the program. Is anyone solve this error? Regard

ghost commented 6 years ago

I faced this problem too a few days ago. I "solved" it by compiling Darknet's code together with my application. I suspect it has something to do with the mixing of nvcc and gcc/clang, specifically with code and/or data address space attribution.

christiandreher commented 6 years ago

I have a similar problem. Can be reproduced by basically this main.c:

#include <stdio.h>
#include <stdlib.h>
#include <darknet.h>

// Paste test_detector from examples/detector.c here

int main(int argc, char* argv[])
{
    cuda_set_device(0);

    char* datacfg = "cfg/coco.data";
    char* cfg = "cfg/yolov3.cfg";
    char* weights = "yolov3.weights";
    char* filename = "data/dog.jpg";

    test_detector(datacfg, cfg, weights, filename, .5, .5, 0, 0);

    return 0;
}

Using the test_detector function from examples/detector.c. It will not work with the default test_detector function, because the line layer l = net->layers[net->n-1]; will return an uninitialised layer struct (i.e. l.classes will be 0 instead of 80, which is the default for the default YOLO v3 config). The result is that nothing gets detected.

Replacing the occurences of l.classes with a hard-coded 80 will make it work, tho.

Is there anything I can do to make it work, and still use darknet as a shared object? And obviously without hard-coding such stuff.

arjun-kava commented 6 years ago

hey @0xf3rn4nd0 can you share your cmake file which. thank you in advance.

HodenX commented 6 years ago

I also come across this issue, but I dont think its the problem of nvcc or gcc. add -DGPU in your Makefile can be a solution

christiandreher commented 6 years ago

My Makefile starts with

GPU=1

...

Still I have this problem. Like OP said, the error is gone for CPU mode

ghost commented 6 years ago

@christiandreher does the problem persists if you compile darknet's library with -fPIC?

@arjun-kava Sorry for taking so long to reply. I didn't use cmake, just make like Darknet itself. Here's an example where I was also using OpenCV. There are 3 targets:

darknet_cuda: Compiles all *.cu files from Darknet
darknet_framework: Compiles all *.c file Darknet
main: Compiles the application that uses Darknet by linking with the *.o files that were created in the previous targets.

Then I do:

$ make darknet-cuda
$ make darknet-framework
$ make main

OPENCV_DIR = $(shell pwd)/ext/opencv
DARKNET_DIR = $(shell pwd)/ext/darknet
BIN_DIR = $(shell pwd)/bin

NVCC = nvcc
NV_CFLAGS = -DGPU -DCUDNN -I$(DARKNET_DIR)/include \
            -gencode arch=compute_60,code=sm_60 -c

NV_CFLAGS_EXTRA = -Wall -fPIC
NV_LFLAGS = -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -lcurand -lcudnn

DARKNET_CFLAGS = -DGPU -DCUDNN -Wall -O3 -march=native \
                 -I/usr/local/cuda/include/ -I$(DARKNET_DIR)/include -c

CC = clang
CFLAGS = -DGPU -DCUDNN -std=c++11 -Wall -O3 -g -I$(OPENCV_DIR)/include \
         -I/usr/local/cuda/include/ -I$(DARKNET_DIR)/include -fPIC

LFLAGS = -L$(OPENCV_DIR)/lib \
     -Wl,-rpath,$(OPENCV_DIR)/lib \
         -L/usr/local/cuda/lib64 \
         -lopencv_core -lopencv_videoio -lopencv_imgproc \
         -lopencv_highgui -lstdc++ -lpthread -lm \
         -lcuda -lcudart -lcublas -lcurand -lcudnn

darknet-cuda: $(DARKNET_DIR)/src/*.cu
    mkdir -p $(BIN_DIR)
    pushd $(BIN_DIR) > /dev/null                                    && \
    $(NVCC) $? $(NV_CFLAGS) --compiler-options "$(NV_CFLAGS_EXTRA)" && \
    popd > /dev/null

darknet-framework: $(DARKNET_DIR)/src/*.c
    mkdir -p $(BIN_DIR)
    pushd $(BIN_DIR) > /dev/null && \
    $(CC) $? $(DARKNET_CFLAGS)   && \
    popd > /dev/null

main: main.cpp $(BIN_DIR)/*.o
    mkdir -p $(BIN_DIR)
    $(CC) $? $(CFLAGS) $(LFLAGS) -o $(BIN_DIR)/$@

clean:
    rm -rf $(BIN_DIR)

By the way, Darknet's examples also compile everything together. Maybe it is easier to start from there since you already know it is working.

christiandreher commented 6 years ago

@0xf3rn4nd0 Thank you for your feedback.

I am quite certain that I did not change anything in the Makefile since then, and looking at it now, there is this line:

CFLAGS=-Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -fPIC
#                                                                   ^^^^^

So it seems that it was compiled with -fPIC in first place. I also just recompiled Darknet (just to be sure) and tried to reproduce the error and it still doesn't work for me. So, I do not get an error, but l.classes is zero.

Please take note that my code changed a lot since then and this is not really an issue for me anymore. However, if you're trying to debug this, feel free to ask me anything.

bvnp44 commented 5 years ago

Why there is srand(2222222);? Can anybody suggest?

christiandreher commented 5 years ago

@bvnp43 It is just some arbitrary number to seed the PRNG

ou525 commented 5 years ago

@christiandreher Can you solve this problem? How to solve it?

christiandreher commented 5 years ago

@ou525 No, like I already mentioned: I worked around it since I am reading the names-file anyway. Doing so, I already have access to the total amount of classes anyway, so I just pass the size of the names-vector (I am using C++) instead of l.classes.

However, if you know the total amount of classes in advance, you can just hard-code the number to work around it. It's not pretty, but it should work. Otherwise you can also just read the names-file and count the classes.

I am certain that there's a better way or a fix even, but for me, it was not worth the effort to investigate further. Hope this helps...

bvnp44 commented 5 years ago

@christiandreher thanks. @ou525 There is another fork of this repo which contain some cpu/gpu optimizations and bug fixes - https://github.com/AlexeyAB/darknet maybe this bug is not present there. I'm working with that fork.

ou525 commented 5 years ago

@christiandreher thank you, I also solved this. Does not affect the use, it may not need to continue research.

ou525 commented 5 years ago

@bvnp43 thanks, let me try

acidtonic commented 5 years ago

I am running into this and was trying to avoid having to build darknet within my application.

Are there any other things to try? I am building with the Intel compiler (icc) and assuming nvcc and it may not be getting along.

toddwong commented 5 years ago

I solved this by adding GPU / CUDNN macro definition before include darknet.h The problem is, when change CPU and/or CUDNN to 1 in the Makefile, the struct layer will change it's fields and layout accordingly

PS: I found many functions have parameter or return type of struct layer (not a const pointer type) in the source code, which means memory copying when passing layers into/out of these functions in my understanding, and it's more than 1.5K memory actually.

pjreddie / darknet

Error in network layer extraction in GPU mode #384