jolibrain / deepdetect

Deep Learning API and Server in C++14 support for PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE
https://www.deepdetect.com/
Other
2.52k stars 561 forks source link

coredump/segfault when training depending on db status #748

Open dgtlmoon opened 4 years ago

dgtlmoon commented 4 years ago

Configuration

16Gb GPU, 32Gb RAM, nvidia-smi seems to work fine

Your question / the problem you're facing:

DeepDetect [ commit 88e93254ead67a8032166e98af3e46837fbba039 ]

I'm able to segfault the server when training depending on if db is absent, or if db is set to false

My end goal is to finetune a Vgg16 and use those weights in simsearch to see if it performs better in my project, found this bug while trying to solve solver creation exception error (still didnt solve it tho!)

Error message (if any) / steps to reproduce the problem:

rm models/vgg16/txt models/vgg16/json models/vgg16/model models/vgg16/proto rm -rf models/vgg16/lmdb sleep 1

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{ "mllib": "caffe", "description": "tag detector vgg16", "type": "supervised", "parameters": { "input": { "connector": "image", "width":224, "height":224 }, "mllib": { "finetuning": true, "nclasses":3, "weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel", "template": "vgg_16" } }, "model": { "templates": "../templates/caffe/", "repository": "/tags_dataset/models/vgg16", "create_repository": true } }'

sleep 2

curl -X POST "http://localhost:8080/train" -d '{ "service": "tag_detect_vgg16", "async": true, "parameters": { "input": { "connector": "image", "test_split":0.1, "shuffle":true, "width":224, "height":224 }, "mllib": { "gpu": true, "resume": false, "net": { "batch_size": 2 }, "solver": { "iterations": 80000, "test_interval": 500, "snapshot": 1000, "solver_type": "RMSPROP", "base_lr": 0.001 }, "noise":{"all_effects":true, "prob":0.001}, "distort":{"all_effects":true, "prob":0.01}, "bbox": true }, "output": { "measure": [ "map", "map_1", "map_2", "map_3" ] } }, "data": [ "/tags_dataset/train.txt", "/tags_dataset/test.txt" ] }'

Here's the segfault - would be really nice to get a better error here

[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels [2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1 [2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb Segmentation fault

if I include `db: true` for example..

curl -X POST "http://localhost:8080/train" -d '{ "service": "tag_detect_vgg16", "async": true, "parameters": { "input": { "shuffle": true, "db": true,


then I get a different error of 

[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416 [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] detected network type is classification [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size= [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=223 [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] input db = true [2020-06-29 19:49:06.031] [caffe] [info] Initializing solver from parameters: [2020-06-29 19:49:06.032] [caffe] [info] Creating training net specified in net_param. [2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16 [2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt [2020-06-29 19:49:06.032] [caffe] [info] Initializing net from parameters: [2020-06-29 19:49:06.032] [caffe] [info] Creating layer / name=data / type=ImageData [2020-06-29 19:49:06.033] [caffe] [info] Creating Layer data [2020-06-29 19:49:06.033] [caffe] [info] data -> data [2020-06-29 19:49:06.033] [caffe] [info] data -> label [2020-06-29 19:49:06.033] [caffe] [info] Opening file

(end of dd logs)

"body": {
    "Error": {
        "code": 500,
        "msg": "InternalError",
        "dd_code": 500,
        "dd_msg": "solver creation exception"
    }
}

}


(would be nice to get a better message here too, but this is at the caffe layer right? any clues?)

- [ ] Server log output:

[2020-06-29 19:43:31.612] [api] [info] Running DeepDetect HTTP server on 0.0.0.0:8080 [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] instantiating model template vgg_16 [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] source=../templates/caffe//vgg_16/ [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] dest=/tags_dataset/models/vgg16/vgg_16.prototxt [2020-06-29 19:43:40.209] [api] [info] 172.17.0.1 "PUT /services/tag_detect_vgg16" 201 758 [2020-06-29 19:43:52.394] [caffe] [info] Opening file /tags_dataset/test.txt [2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels [2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1 [2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb Segmentation fault

dgtlmoon commented 4 years ago

I believe this is caused by

      "db_width": 224,
      "db_height": 224

missing in the "parameters": {.. "input": { part of the /train call

dgtlmoon commented 4 years ago

note, I still get "dd_msg": "solver creation exception" when I create the service with

  "parameters": {
    "input": {
      "connector": "image",
       "width":224,
       "height":224

and I have in the /train call...

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": true

And the following will cause a segfault

    "parameters": {
        "input": {
          "db_width": 224,
           "db_height": 224,
           "db": false
dgtlmoon commented 4 years ago

742 maybe related?

dgtlmoon commented 4 years ago

note: https://www.deepdetect.com/server/docs/train-image-classifier/ is missing the db: true part aswell solver creation exception first mentioned at https://gitter.im/beniz/deepdetect?at=5ec942692c49c45f5a994000

dgtlmoon commented 4 years ago

same segfault on docker jolibrain/deepdetect_cpu and jolibrain/deepdetect_gpu

beniz commented 4 years ago

Hi, thanks for the thorough report as we try to get around any possible crash. We'll investigate the crash, however, your issue with training the model is elsewhere I believe, see below:

Let us know how this goes.

dgtlmoon commented 4 years ago

@beniz

1) I'm training (finetuning) a object detector, not classifier, but important here is to have a fine tuned model for simsearch. (per your recommendation to use Vgg16 for simsearch, but I found the results could be a lot better, I assume by fine-training - I also need to extract image objects from the scene) 2) Training an image classifier requires db to be set to true - that's what this bug is about, I know that :) 3) I believe my measure field is correct because I'm training for object detection (finetuning using object detection, is that right?), but I also tried with the example values and there is no change, still getting the exception. 4) https://www.deepdetect.com/server/docs/train-image-classifier/ is still missing db: true 5) I have the vgg16 caffeemodel in the right place

dd@44b35786ec21:/opt/deepdetect/build/main$ ls -al /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
-rwxrwxrwx 1 dd dd 553432081 Mar 22  2019 /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel

Using the following still does not fix my solver creation exception

curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
  "mllib": "caffe",
  "description": "tag detector vgg16",
  "type": "supervised",
  "parameters": {
    "input": {
      "connector": "image",
      "width": 224,
      "height": 224
    },
    "mllib": {
      "finetuning": true,
       "nclasses": 3,
      "template": "vgg_16",
      "weights" : "VGG_ILSVRC_16_layers.caffemodel"
    }
  },
  "model": {
    "templates": "../templates/caffe/",
    "repository": "/tags_dataset/models/vgg16"
  }
}'
docker logs dd_tags

sleep 3

curl -X POST "http://localhost:8080/train" -d '{
    "service": "tag_detect_vgg16",
    "async": true,
    "parameters": {
    "input": {
      "db": true,
           "connector": "image",
      "db_width": 224,
      "db_height": 224
    },
    "mllib": {
        "gpu": true,
        "mirror":true,
        "net": {
        "batch_size": 2
        },
        "solver": {
        "iterations": 80000,
        "test_interval": 500,
        "snapshot": 1000,
        "solver_type": "RMSPROP",
        "base_lr": 0.0001
        },
        "noise":{"all_effects":true, "prob":0.001},
        "distort":{"all_effects":true, "prob":0.01},
        "bbox": true
    },
         "output": {
             "measure":["acc","mcll","f1"]
         }
    },
    "data": [
    "/tags_dataset/train.txt",
    "/tags_dataset/test.txt"
    ]
}'
[2020-06-30 08:48:03.860] [tag_detect_vgg16] [info] Using pre-trained weights from /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 553432081
[2020-06-30 08:48:04.493] [caffe] [info] Attempting to upgrade input file specified using deprecated V1LayerParameter: /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[2020-06-30 08:48:05.335] [caffe] [info] Successfully upgraded file specified using deprecated V1LayerParameter
[2020-06-30 08:48:05.470] [caffe] [info] Ignoring source layer fc8
[2020-06-30 08:48:05.499] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=2229
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] input db = true
[2020-06-30 08:48:05.500] [caffe] [info] Initializing solver from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating training net specified in net_param.
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-30 08:48:05.500] [caffe] [info] Initializing net from parameters: 
[2020-06-30 08:48:05.500] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-30 08:48:05.500] [caffe] [info] Creating Layer data
[2020-06-30 08:48:05.501] [caffe] [info] data -> data
[2020-06-30 08:48:05.501] [caffe] [info] data -> label
[2020-06-30 08:48:05.501] [caffe] [info] Opening file 

(end)

{
    "status": {
        "code": 200,
        "msg": "OK"
    },
    "head": {
        "method": "/train",
        "job": 1,
        "status": "error"
    },
    "body": {
        "Error": {
            "code": 500,
            "msg": "InternalError",
            "dd_code": 500,
            "dd_msg": "solver creation exception"
        }
    }
}
beniz commented 4 years ago

You can't train a detector with vgg16. Simsearch by default uses a classification model. Look at https://www.deepdetect.com/applications/img_simsearch/

dgtlmoon commented 4 years ago

@beniz ok great, so it sounds like I was confused by your recommendation to use vgg16 with my object detector :), So in which case I have about 8 classes of image to train on with ~100,000 images in each class.

So then I'll use the finetuned output of the trained vgg16 image classifier as my simsearch model? does that sound right? thanks!

dgtlmoon commented 4 years ago

And use squeezenet as the object detector, chained with my vgg16 finetuned model for imgsearch