Open dgtlmoon opened 4 years ago
I believe this is caused by
"db_width": 224,
"db_height": 224
missing in the "parameters": {.. "input": {
part of the /train
call
note, I still get "dd_msg": "solver creation exception"
when I create the service with
"parameters": {
"input": {
"connector": "image",
"width":224,
"height":224
and I have in the /train
call...
"parameters": {
"input": {
"db_width": 224,
"db_height": 224,
"db": true
And the following will cause a segfault
"parameters": {
"input": {
"db_width": 224,
"db_height": 224,
"db": false
note: https://www.deepdetect.com/server/docs/train-image-classifier/ is missing the db: true
part aswell
solver creation exception
first mentioned at https://gitter.im/beniz/deepdetect?at=5ec942692c49c45f5a994000
same segfault on docker jolibrain/deepdetect_cpu
and jolibrain/deepdetect_gpu
Hi, thanks for the thorough report as we try to get around any possible crash. We'll investigate the crash, however, your issue with training the model is elsewhere I believe, see below:
"weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel"
, you should set "weights":"VGG_ILSVRC_16_layers.caffemodel
and copy the file VGG_ILSVRC_16_layers.caffemodel
into the model director beforehand
Training an image classifier requires one directory per class, with images in each directory, and you want to pass the directory to the data
field:
"data": [
"/path/to/directories/"
]
see the full documentation here: https://www.deepdetect.com/server/docs/train-image-classifier/
Training an image classifier requires db
to be set to true
Your measure
field is wrong, see the doc pointer above, it should have everything you need.
Let us know how this goes.
@beniz
1) I'm training (finetuning) a object detector, not classifier, but important here is to have a fine tuned model for simsearch. (per your recommendation to use Vgg16 for simsearch, but I found the results could be a lot better, I assume by fine-training - I also need to extract image objects from the scene)
2) Training an image classifier requires db to be set to true
- that's what this bug is about, I know that :)
3) I believe my measure
field is correct because I'm training for object detection (finetuning using object detection, is that right?), but I also tried with the example values and there is no change, still getting the exception.
4) https://www.deepdetect.com/server/docs/train-image-classifier/ is still missing db: true
5) I have the vgg16 caffeemodel in the right place
dd@44b35786ec21:/opt/deepdetect/build/main$ ls -al /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
-rwxrwxrwx 1 dd dd 553432081 Mar 22 2019 /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
Using the following still does not fix my solver creation exception
curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{
"mllib": "caffe",
"description": "tag detector vgg16",
"type": "supervised",
"parameters": {
"input": {
"connector": "image",
"width": 224,
"height": 224
},
"mllib": {
"finetuning": true,
"nclasses": 3,
"template": "vgg_16",
"weights" : "VGG_ILSVRC_16_layers.caffemodel"
}
},
"model": {
"templates": "../templates/caffe/",
"repository": "/tags_dataset/models/vgg16"
}
}'
docker logs dd_tags
sleep 3
curl -X POST "http://localhost:8080/train" -d '{
"service": "tag_detect_vgg16",
"async": true,
"parameters": {
"input": {
"db": true,
"connector": "image",
"db_width": 224,
"db_height": 224
},
"mllib": {
"gpu": true,
"mirror":true,
"net": {
"batch_size": 2
},
"solver": {
"iterations": 80000,
"test_interval": 500,
"snapshot": 1000,
"solver_type": "RMSPROP",
"base_lr": 0.0001
},
"noise":{"all_effects":true, "prob":0.001},
"distort":{"all_effects":true, "prob":0.01},
"bbox": true
},
"output": {
"measure":["acc","mcll","f1"]
}
},
"data": [
"/tags_dataset/train.txt",
"/tags_dataset/test.txt"
]
}'
[2020-06-30 08:48:03.860] [tag_detect_vgg16] [info] Using pre-trained weights from /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 553432081
[2020-06-30 08:48:04.493] [caffe] [info] Attempting to upgrade input file specified using deprecated V1LayerParameter: /tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel
[2020-06-30 08:48:05.335] [caffe] [info] Successfully upgraded file specified using deprecated V1LayerParameter
[2020-06-30 08:48:05.470] [caffe] [info] Ignoring source layer fc8
[2020-06-30 08:48:05.499] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] detected network type is classification
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size=
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=2229
[2020-06-30 08:48:05.500] [tag_detect_vgg16] [info] input db = true
[2020-06-30 08:48:05.500] [caffe] [info] Initializing solver from parameters:
[2020-06-30 08:48:05.500] [caffe] [info] Creating training net specified in net_param.
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16
[2020-06-30 08:48:05.500] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt
[2020-06-30 08:48:05.500] [caffe] [info] Initializing net from parameters:
[2020-06-30 08:48:05.500] [caffe] [info] Creating layer / name=data / type=ImageData
[2020-06-30 08:48:05.500] [caffe] [info] Creating Layer data
[2020-06-30 08:48:05.501] [caffe] [info] data -> data
[2020-06-30 08:48:05.501] [caffe] [info] data -> label
[2020-06-30 08:48:05.501] [caffe] [info] Opening file
(end)
{
"status": {
"code": 200,
"msg": "OK"
},
"head": {
"method": "/train",
"job": 1,
"status": "error"
},
"body": {
"Error": {
"code": 500,
"msg": "InternalError",
"dd_code": 500,
"dd_msg": "solver creation exception"
}
}
}
You can't train a detector with vgg16. Simsearch by default uses a classification model. Look at https://www.deepdetect.com/applications/img_simsearch/
@beniz ok great, so it sounds like I was confused by your recommendation to use vgg16 with my object detector :), So in which case I have about 8 classes of image to train on with ~100,000 images in each class.
So then I'll use the finetuned output of the trained vgg16 image classifier as my simsearch model? does that sound right? thanks!
And use squeezenet
as the object detector, chained with my vgg16 finetuned model for imgsearch
Configuration
Docker version 19.03.1, build 74b1e89
88e93254ead67a8032166e98af3e46837fbba039
16Gb GPU, 32Gb RAM,
nvidia-smi
seems to work fineYour question / the problem you're facing:
DeepDetect [ commit 88e93254ead67a8032166e98af3e46837fbba039 ]
I'm able to segfault the server when training depending on if
db
is absent, or ifdb
is set to falseMy end goal is to finetune a Vgg16 and use those weights in simsearch to see if it performs better in my project, found this bug while trying to solve
solver creation exception
error (still didnt solve it tho!)Error message (if any) / steps to reproduce the problem:
rm models/vgg16/txt models/vgg16/json models/vgg16/model models/vgg16/proto rm -rf models/vgg16/lmdb sleep 1
curl -X PUT "http://localhost:8080/services/tag_detect_vgg16" -d '{ "mllib": "caffe", "description": "tag detector vgg16", "type": "supervised", "parameters": { "input": { "connector": "image", "width":224, "height":224 }, "mllib": { "finetuning": true, "nclasses":3, "weights": "/tags_dataset/models/vgg16/VGG_ILSVRC_16_layers.caffemodel", "template": "vgg_16" } }, "model": { "templates": "../templates/caffe/", "repository": "/tags_dataset/models/vgg16", "create_repository": true } }'
sleep 2
curl -X POST "http://localhost:8080/train" -d '{ "service": "tag_detect_vgg16", "async": true, "parameters": { "input": { "connector": "image", "test_split":0.1, "shuffle":true, "width":224, "height":224 }, "mllib": { "gpu": true, "resume": false, "net": { "batch_size": 2 }, "solver": { "iterations": 80000, "test_interval": 500, "snapshot": 1000, "solver_type": "RMSPROP", "base_lr": 0.001 }, "noise":{"all_effects":true, "prob":0.001}, "distort":{"all_effects":true, "prob":0.01}, "bbox": true }, "output": { "measure": [ "map", "map_1", "map_2", "map_3" ] } }, "data": [ "/tags_dataset/train.txt", "/tags_dataset/test.txt" ] }'
[2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels [2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1 [2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb Segmentation fault
curl -X POST "http://localhost:8080/train" -d '{ "service": "tag_detect_vgg16", "async": true, "parameters": { "input": { "shuffle": true, "db": true,
[2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] Net total flops=15466180608 / total params=134260416 [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] detected network type is classification [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] user batch_size=2 / inputc batch_size= [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] batch_size=2 / test_batch_size=2 / test_iter=223 [2020-06-29 19:49:06.031] [tag_detect_vgg16] [info] input db = true [2020-06-29 19:49:06.031] [caffe] [info] Initializing solver from parameters: [2020-06-29 19:49:06.032] [caffe] [info] Creating training net specified in net_param. [2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer vgg16 [2020-06-29 19:49:06.032] [caffe] [info] The NetState phase (0.000000) differed from the phase (1.000000) specified by a rule in layer probt [2020-06-29 19:49:06.032] [caffe] [info] Initializing net from parameters: [2020-06-29 19:49:06.032] [caffe] [info] Creating layer / name=data / type=ImageData [2020-06-29 19:49:06.033] [caffe] [info] Creating Layer data [2020-06-29 19:49:06.033] [caffe] [info] data -> data [2020-06-29 19:49:06.033] [caffe] [info] data -> label [2020-06-29 19:49:06.033] [caffe] [info] Opening file
(end of dd logs)
}
[2020-06-29 19:43:31.612] [api] [info] Running DeepDetect HTTP server on 0.0.0.0:8080 [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] instantiating model template vgg_16 [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] source=../templates/caffe//vgg_16/ [2020-06-29 19:43:40.205] [tag_detect_vgg16] [info] dest=/tags_dataset/models/vgg16/vgg_16.prototxt [2020-06-29 19:43:40.209] [api] [info] 172.17.0.1 "PUT /services/tag_detect_vgg16" 201 758 [2020-06-29 19:43:52.394] [caffe] [info] Opening file /tags_dataset/test.txt [2020-06-29 19:43:52.394] [caffe] [info] Read 492.000000 images with 0.000000 labels [2020-06-29 19:43:52.395] [api] [info] 172.17.0.1 "POST /train" 201 1 [2020-06-29 19:43:52.395] [caffe] [info] Opened lmdb /tags_dataset/models/vgg16/test.lmdb Segmentation fault