Closed vadamoto closed 8 years ago
Make sure you are correctly specifying the alphabet
and other required values, as need in your PUT /services
call, and see https://github.com/beniz/deepdetect/issues/145#issuecomment-225829142
My service creation call looks like this:
2016-07-01 17:10:19.552387 PUT /services/net9
{"model": {"templates": "../templates/caffe/", "repository": "/ddroot/net9_model"}, "type": "supervised", "description": "character-based classifier", "parameters": {"input": {"connector": "txt", "alphabet": "abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\"#\"/\\|_@#$%^&*~`+-=<>\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451", "sequence": 140, "characters": true, "sentences": true}, "mllib": {"layers": ["1CR256", "1CR256", "4CR256", "1024", "1024"], "dropout": 0.5, "db": true, "nclasses": 9, "template": "convnet"}, "output": {}}, "mllib": "caffe"}
2016-07-01 17:10:19.771427 {"status":{"code":201,"msg":"Created"}}
and prediction call looks like this:
2016-07-01 17:11:07.513249 curl -X POST 'http://localhost:8080/predict' -d '{"data": ["/ddroot/test_file.net9.utf.txt"], "parameters": {"input": {"alphabet": "abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\"#\"/\\|_@#$%^&*~`+-=<>\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451", "sentences": true}, "mllib": {"gpu": false, "db": false}, "output": {"measure": ["f1"]}}, "service": "net9"}'
{u'status': {u'msg': u'InternalError', u'dd_msg': u'src/caffe/blob.cpp:35 / Check failed (custom): (shape[i]) <= (2147483647 / count_)', u'code': 500, u'dd_code': 1007}}
This is usually Caffe failing the check on the max size of some blob of data in the net. Posting the server output, that lists all of the net's layers and their allocated memory, along with the reported error on server side would be useful.
Now, is this your own trained model, or are you using a pre-trained model ? If you did train your own model, please post the PUT /services
and POST /train
calls used for the training phase.
Some of the things you can try to debug your service are to reduce the size of the alphabet
and/or the 'sequence` value. Another thing I'd try is a to remove the utf characters to see whether something is wrong on that side. Of course all this needs to be done at both training and prediction times.
10.60.129.247 beniz/deepdetect_cpu
libdc1394 error: Failed to initialize libdc1394
DeepDetect [ commit 34f5f5df8b58be0b5aba79104db508e2be5e8aa2 ]
Running DeepDetect HTTP server on 0.0.0.0:8080
instantiating model template convnet
[14:37:29] /opt/deepdetect/src/caffelib.cc:76: source=../templates/caffe/convnet/
[14:37:29] /opt/deepdetect/src/caffelib.cc:77: dest=/ddroot/net9_model/convnet.prototxt
INFO - 14:37:53 - Initializing net from parameters:
INFO - 14:37:53 - Creating layer data
INFO - 14:37:53 - Creating Layer data
INFO - 14:37:53 - data -> data
INFO - 14:37:53 - data -> label
INFO - 14:37:53 - Setting up data
INFO - 14:37:53 - Top shape: 64 1 140 95 (851200)
INFO - 14:37:53 - Top shape: 64 (64)
INFO - 14:37:53 - Memory required for data: 3405056
INFO - 14:37:53 - Creating layer conv0
INFO - 14:37:53 - Creating Layer conv0
INFO - 14:37:53 - conv0 <- data
INFO - 14:37:53 - conv0 -> conv0
INFO - 14:37:53 - Setting up conv0
INFO - 14:37:53 - Top shape: 64 256 134 1 (2195456)
INFO - 14:37:53 - Memory required for data: 12186880
INFO - 14:37:53 - Creating layer act0
INFO - 14:37:53 - Creating Layer act0
INFO - 14:37:53 - act0 <- conv0
INFO - 14:37:53 - act0 -> conv0 (in-place)
INFO - 14:37:53 - Setting up act0
INFO - 14:37:53 - Top shape: 64 256 134 1 (2195456)
INFO - 14:37:53 - Memory required for data: 20968704
INFO - 14:37:53 - Creating layer pool0
INFO - 14:37:53 - Creating Layer pool0
INFO - 14:37:53 - pool0 <- conv0
INFO - 14:37:53 - pool0 -> pool0
INFO - 14:37:53 - Setting up pool0
INFO - 14:37:53 - Top shape: 64 256 45 1 (737280)
INFO - 14:37:53 - Memory required for data: 23917824
INFO - 14:37:53 - Creating layer conv1
INFO - 14:37:53 - Creating Layer conv1
INFO - 14:37:53 - conv1 <- pool0
INFO - 14:37:53 - conv1 -> conv1
INFO - 14:37:53 - Setting up conv1
INFO - 14:37:53 - Top shape: 64 256 39 1 (638976)
INFO - 14:37:53 - Memory required for data: 26473728
INFO - 14:37:53 - Creating layer act1
INFO - 14:37:53 - Creating Layer act1
INFO - 14:37:53 - act1 <- conv1
INFO - 14:37:53 - act1 -> conv1 (in-place)
INFO - 14:37:53 - Setting up act1
INFO - 14:37:53 - Top shape: 64 256 39 1 (638976)
INFO - 14:37:53 - Memory required for data: 29029632
INFO - 14:37:53 - Creating layer pool1
INFO - 14:37:53 - Creating Layer pool1
INFO - 14:37:53 - pool1 <- conv1
INFO - 14:37:53 - pool1 -> pool1
INFO - 14:37:53 - Setting up pool1
INFO - 14:37:53 - Top shape: 64 256 13 1 (212992)
INFO - 14:37:53 - Memory required for data: 29881600
INFO - 14:37:53 - Creating layer conv2
INFO - 14:37:53 - Creating Layer conv2
INFO - 14:37:53 - conv2 <- pool1
INFO - 14:37:53 - conv2 -> conv2
INFO - 14:37:53 - Setting up conv2
INFO - 14:37:53 - Top shape: 64 256 11 1 (180224)
INFO - 14:37:53 - Memory required for data: 30602496
INFO - 14:37:53 - Creating layer act2
INFO - 14:37:53 - Creating Layer act2
INFO - 14:37:53 - act2 <- conv2
INFO - 14:37:53 - act2 -> conv2 (in-place)
INFO - 14:37:53 - Setting up act2
INFO - 14:37:53 - Top shape: 64 256 11 1 (180224)
INFO - 14:37:53 - Memory required for data: 31323392
INFO - 14:37:53 - Creating layer conv3
INFO - 14:37:53 - Creating Layer conv3
INFO - 14:37:53 - conv3 <- conv2
INFO - 14:37:53 - conv3 -> conv3
INFO - 14:37:53 - Setting up conv3
INFO - 14:37:53 - Top shape: 64 256 9 1 (147456)
INFO - 14:37:53 - Memory required for data: 31913216
INFO - 14:37:53 - Creating layer act3
INFO - 14:37:53 - Creating Layer act3
INFO - 14:37:53 - act3 <- conv3
INFO - 14:37:53 - act3 -> conv3 (in-place)
INFO - 14:37:53 - Setting up act3
INFO - 14:37:53 - Top shape: 64 256 9 1 (147456)
INFO - 14:37:53 - Memory required for data: 32503040
INFO - 14:37:53 - Creating layer conv4
INFO - 14:37:53 - Creating Layer conv4
INFO - 14:37:53 - conv4 <- conv3
INFO - 14:37:53 - conv4 -> conv4
INFO - 14:37:53 - Setting up conv4
INFO - 14:37:53 - Top shape: 64 256 7 1 (114688)
INFO - 14:37:53 - Memory required for data: 32961792
INFO - 14:37:53 - Creating layer act4
INFO - 14:37:53 - Creating Layer act4
INFO - 14:37:53 - act4 <- conv4
INFO - 14:37:53 - act4 -> conv4 (in-place)
INFO - 14:37:53 - Setting up act4
INFO - 14:37:53 - Top shape: 64 256 7 1 (114688)
INFO - 14:37:53 - Memory required for data: 33420544
INFO - 14:37:53 - Creating layer conv5
INFO - 14:37:53 - Creating Layer conv5
INFO - 14:37:53 - conv5 <- conv4
INFO - 14:37:53 - conv5 -> conv5
INFO - 14:37:53 - Setting up conv5
INFO - 14:37:53 - Top shape: 64 256 5 1 (81920)
INFO - 14:37:53 - Memory required for data: 33748224
INFO - 14:37:53 - Creating layer act5
INFO - 14:37:53 - Creating Layer act5
INFO - 14:37:53 - act5 <- conv5
INFO - 14:37:53 - act5 -> conv5 (in-place)
INFO - 14:37:53 - Setting up act5
INFO - 14:37:53 - Top shape: 64 256 5 1 (81920)
INFO - 14:37:53 - Memory required for data: 34075904
INFO - 14:37:53 - Creating layer pool2
INFO - 14:37:53 - Creating Layer pool2
INFO - 14:37:53 - pool2 <- conv5
INFO - 14:37:53 - pool2 -> pool2
INFO - 14:37:53 - Setting up pool2
INFO - 14:37:53 - Top shape: 64 256 2 1 (32768)
INFO - 14:37:53 - Memory required for data: 34206976
INFO - 14:37:53 - Creating layer resphape0
INFO - 14:37:53 - Creating Layer resphape0
INFO - 14:37:53 - resphape0 <- pool2
INFO - 14:37:53 - resphape0 -> reshape0
INFO - 14:37:53 - Setting up resphape0
INFO - 14:37:53 - Top shape: 64 512 (32768)
INFO - 14:37:53 - Memory required for data: 34338048
INFO - 14:37:53 - Creating layer reshape0_resphape0_0_split
INFO - 14:37:53 - Creating Layer reshape0_resphape0_0_split
INFO - 14:37:53 - reshape0_resphape0_0_split <- reshape0
INFO - 14:37:53 - reshape0_resphape0_0_split -> reshape0_resphape0_0_split_0
INFO - 14:37:53 - reshape0_resphape0_0_split -> reshape0_resphape0_0_split_1
INFO - 14:37:53 - Setting up reshape0_resphape0_0_split
INFO - 14:37:53 - Top shape: 64 512 (32768)
INFO - 14:37:53 - Top shape: 64 512 (32768)
INFO - 14:37:53 - Memory required for data: 34600192
INFO - 14:37:53 - Creating layer ip3
INFO - 14:37:53 - Creating Layer ip3
INFO - 14:37:53 - ip3 <- reshape0_resphape0_0_split_0
INFO - 14:37:53 - ip3 -> ip3
INFO - 14:37:53 - Setting up ip3
INFO - 14:37:53 - Top shape: 64 1024 (65536)
INFO - 14:37:53 - Memory required for data: 34862336
INFO - 14:37:53 - Creating layer act7
INFO - 14:37:53 - Creating Layer act7
INFO - 14:37:53 - act7 <- ip3
INFO - 14:37:53 - act7 -> ip3 (in-place)
INFO - 14:37:53 - Setting up act7
INFO - 14:37:53 - Top shape: 64 1024 (65536)
INFO - 14:37:53 - Memory required for data: 35124480
INFO - 14:37:53 - Creating layer ip4
INFO - 14:37:53 - Creating Layer ip4
INFO - 14:37:53 - ip4 <- reshape0_resphape0_0_split_1
INFO - 14:37:53 - ip4 -> ip4
INFO - 14:37:53 - Setting up ip4
INFO - 14:37:53 - Top shape: 64 1024 (65536)
INFO - 14:37:53 - Memory required for data: 35386624
INFO - 14:37:53 - Creating layer act8
INFO - 14:37:53 - Creating Layer act8
INFO - 14:37:53 - act8 <- ip4
INFO - 14:37:53 - act8 -> ip4 (in-place)
INFO - 14:37:53 - Setting up act8
INFO - 14:37:53 - Top shape: 64 1024 (65536)
INFO - 14:37:53 - Memory required for data: 35648768
INFO - 14:37:53 - Creating layer ip5
INFO - 14:37:53 - Creating Layer ip5
INFO - 14:37:53 - ip5 <- ip4
INFO - 14:37:53 - ip5 -> ip5
INFO - 14:37:53 - Setting up ip5
INFO - 14:37:53 - Top shape: 64 9 (576)
INFO - 14:37:53 - Memory required for data: 35651072
INFO - 14:37:53 - Creating layer loss
INFO - 14:37:53 - Creating Layer loss
INFO - 14:37:53 - loss <- ip5
INFO - 14:37:53 - loss -> loss
INFO - 14:37:53 - Setting up loss
INFO - 14:37:53 - Top shape: 64 9 (576)
INFO - 14:37:53 - Memory required for data: 35653376
INFO - 14:37:53 - loss does not need backward computation.
INFO - 14:37:53 - ip5 does not need backward computation.
INFO - 14:37:53 - act8 does not need backward computation.
INFO - 14:37:53 - ip4 does not need backward computation.
INFO - 14:37:53 - act7 does not need backward computation.
INFO - 14:37:53 - ip3 does not need backward computation.
INFO - 14:37:53 - reshape0_resphape0_0_split does not need backward computation.
INFO - 14:37:53 - resphape0 does not need backward computation.
INFO - 14:37:53 - pool2 does not need backward computation.
INFO - 14:37:53 - act5 does not need backward computation.
INFO - 14:37:53 - conv5 does not need backward computation.
INFO - 14:37:53 - act4 does not need backward computation.
INFO - 14:37:53 - conv4 does not need backward computation.
INFO - 14:37:53 - act3 does not need backward computation.
INFO - 14:37:53 - conv3 does not need backward computation.
INFO - 14:37:53 - act2 does not need backward computation.
INFO - 14:37:53 - conv2 does not need backward computation.
INFO - 14:37:53 - pool1 does not need backward computation.
INFO - 14:37:53 - act1 does not need backward computation.
INFO - 14:37:53 - conv1 does not need backward computation.
INFO - 14:37:53 - pool0 does not need backward computation.
INFO - 14:37:53 - act0 does not need backward computation.
INFO - 14:37:53 - conv0 does not need backward computation.
INFO - 14:37:53 - data does not need backward computation.
INFO - 14:37:53 - This network produces output ip3
INFO - 14:37:53 - This network produces output label
[14:37:53] /opt/deepdetect/src/caffelib.cc:1188: Using pre-trained weights from /ddroot/net9_model/model_iter_20000.caffemodel
INFO - 14:37:53 - This network produces output loss
INFO - 14:37:53 - Network initialization done.
INFO - 14:37:53 - Ignoring source layer reshape0
INFO - 14:37:53 - Ignoring source layer reshape0_reshape0_0_split
INFO - 14:37:53 - Ignoring source layer drop3
list subdirs size=9
loaded text samples=105090WARNING: Logging before InitGoogleLogging() is written to STDERR
I0701 14:37:56.329766 8 txtinputfileconn.cc:186] vocabulary size=0
2016-05-04 18:13:53.694079 PUT /services/net9
{"model": {"templates": "../templates/caffe/", "repository": "/ddroot/net9_model"}, "type": "supervised", "description": "character-based classifier", "parameters": {"input": {"connector": "txt", "alphabet": "abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\"#\"/\\|_@#$%^&*~`+-=<>\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451", "sequence": 140, "characters": true, "sentences": true}, "mllib": {"layers": ["1CR256", "1CR256", "4CR256", "1024", "1024"], "dropout": 0.5, "db": true, "nclasses": 9, "template": "convnet"}, "output": {}}, "mllib": "caffe"}
2016-05-04 18:13:53.867046 {"status":{"code":201,"msg":"Created"}}
2016-05-04 18:13:53.867462 curl -X POST 'http://localhost:8080/train' -d '{"async": true, "data": ["/ddroot/train_file.net9.utf.txt"], "parameters": {"input": {"shuffle": true, "sequence": 140, "test_split": 0.01, "db": true, "characters": true, "sentences": true}, "mllib": {"gpu": true, "net": {"test_batch_size": 128, "batch_size": 128}, "solver": {"test_initialization": true, "stepsize": 15000, "base_lr": 0.01, "iterations": 20000, "test_interval": 1000, "weight_decay": 1e-05, "snapshot": 20000, "lr_policy": "step", "solver_type": "SGD", "gamma": 0.5, "iter_size": 1}, "resume": false}, "output": {"measure": ["mcll", "f1"]}}, "service": "net9"}'
Can you confirm that the training did complete successfully ? It looks like a very large net for only 140 character sequences. If unsure, just list the content of your net9_model
repository. Also, since this looks like a configuration issue, you may want to join on gitter so that we can switch to live chat.
$ find /ddroot/net9_model -type f | xargs ls -lsa
8 -rw-r--r-- 1 dd dd 4429 Jul 1 14:37 /ddroot/net9_model/convnet.prototxt
4 -rw-r--r-- 1 dd dd 216 Jul 1 14:37 /ddroot/net9_model/convnet_solver.prototxt
4 -rw-r--r-- 1 dd dd 63 May 4 15:13 /ddroot/net9_model/corresp.txt
4 -rw-r--r-- 1 dd dd 3797 Jul 1 14:37 /ddroot/net9_model/deploy.prototxt
12 -rw-r--r-- 1 dd dd 8360 Jul 1 14:37 /ddroot/net9_model/model.json
9680 -rw-r--r-- 1 dd dd 9908972 May 8 12:25 /ddroot/net9_model/model_iter_20000.caffemodel
9676 -rw-r--r-- 1 dd dd 9907553 May 8 12:25 /ddroot/net9_model/model_iter_20000.solverstate
107120 -rw-r--r-- 1 dd dd 109690880 May 4 15:16 /ddroot/net9_model/test.lmdb/data.mdb
4 -rw-r--r-- 1 dd dd 8192 May 8 12:27 /ddroot/net9_model/test.lmdb/lock.mdb
10597572 -rw-r--r-- 1 dd dd 10851909632 May 4 15:16 /ddroot/net9_model/train.lmdb/data.mdb
4 -rw-r--r-- 1 dd dd 8192 May 8 12:25 /ddroot/net9_model/train.lmdb/lock.mdb
0 -rw-r--r-- 1 dd dd 0 May 4 15:13 /ddroot/net9_model/vocab.dat
Your model is two months older than your model definition files, are trying to use a pre-trained model ? Can you explain what you are doing ? Also your training call has a single file in data
, and this seems weird, so possibly unrelated to your present issue.
Overall, my hunch is that you are using too large of a net somehow, and I'm not sure what you are training on. I'd recommend you try with some of the example datasets and see if you can reproduce and then lay the steps so that we can help you further.
Your model is two months older than your model definition files, are trying to use a pre-trained model ?
Yes, I trained it two months ago. I couldn't run "predict" phase because I had only 8GB RAM, and I understood that wasn't enough, judging by error messages I was receiving. Now I have 16 GB installed, so I tried to run prediction again.
Also your training call has a single file in data, and this seems weird, so possibly unrelated to your present issue.
Yes, it is named weirdly : "data": ["/ddroot/train_file.net9.utf.txt"]
, but it is actually a directory, with the following structure:
$ find train_file.net9.utf.txt
train_file.net9.utf.txt
train_file.net9.utf.txt/5508
train_file.net9.utf.txt/5508/data.txt
train_file.net9.utf.txt/5506
train_file.net9.utf.txt/5506/data.txt
train_file.net9.utf.txt/5503
train_file.net9.utf.txt/5503/data.txt
train_file.net9.utf.txt/5502
train_file.net9.utf.txt/5502/data.txt
train_file.net9.utf.txt/5517
train_file.net9.utf.txt/5517/data.txt
train_file.net9.utf.txt/3104
train_file.net9.utf.txt/3104/data.txt
train_file.net9.utf.txt/5505
train_file.net9.utf.txt/5505/data.txt
train_file.net9.utf.txt/5507
train_file.net9.utf.txt/5507/data.txt
train_file.net9.utf.txt/5511
train_file.net9.utf.txt/5511/data.txt
I'd recommend you try with some of the example datasets and see if you can reproduce and then lay the steps so that we can help you further.
Thanks
Yes, I trained it two months ago. I couldn't run "predict" phase because I had only 8GB RAM, and I understood that wasn't enough, judging by error messages I was receiving.
My current understanding is that it is impossible for the predict phase net to eat more RAM than the same net at training time and given the same batch size. So unless you've tried to predict over way too many samples at once, I don't believe the RAM was the issue at the time already.
Posting all your prototxt
file into a gist for me to look at may help...
https://gist.github.com/vadamoto/9c5124974de793fc0e622c90279789a3
I thougt training succeeded only because I was using db=True mode. I tried prediciton with and without db mode, no luck.
Also, I tried reducing sequence from 140 to 100, and it trained in only two days, without db=True mode, looks like everything is okay, but I' getting the very same error message on trying to /predict.
Have you tried prediction from one of the pre-trained models ? Is it working ?
No :(
curl -X PUT 'http://localhost:8080/services/sent_en' -d '{"mllib":"caffe","description":"English sentiment classification","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"alphabet":"abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\''","sequence":140},"mllib":{"nclasses":2}},"model":{"repository":"/ddroot/sent_en_char"}}'
{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"/opt/deepdetect/build/caffe_dd/src/caffe_dd/include/caffe/llogging.h:158 / Fatal Caffe error"}}
UPDATE: created. Needed to add "template" key to copy-pasted line from help-page:
$ curl -X PUT 'http://localhost:8080/services/sent_en' -d '{"mllib":"caffe","description":"English sentiment classification","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"alphabet":"abcdefghijklmnopqrstuvwxyz0123456789,;.!?'\''","sequence":140},"mllib":{"nclasses":2, "gpu":false, "template":"convnet"}},"model":{"templates": "../templates/caffe/", "repository":"/ddroot/sent_en_char"}}'
{"status":{"code":201,"msg":"Created"}}
Now /predict says the same:
$ curl -X POST 'http://localhost:8080/predict' -d '{"service":"sent_en","parameters":{"mllib":{"gpu":true}},"data":["Chilling in the West Indies"]}'
{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"/opt/deepdetect/build/caffe_dd/src/caffe_dd/include/caffe/llogging.h:158 / Fatal Caffe error"}}
Needed to add "template" key to copy-pasted line from help-page
No you don't. The pre-trained model already contains this information. Have you downloaded the sent_en_char.tar.bz2 tarball ?
I've just tested it and it works fine, including prediction.
Of course I downloaded the tarball. But you are right: I redid all the steps, and now sent_en is working (although predicting "negative" for "Chilling in the West Indies"). I think maybe some docker-related issues. Going to try again training/predicting on my net9 dataset.
I think this is this case: https://github.com/BVLC/caffe/issues/3084
I managed to get it working for a small number of test samples (e.g. 22000 samples), but my initial test dataset has about 100000 samples, and I'm getting the same error message as stated in this issue subject.
Try using the test_batch_size
parameter with a low enough value and try to pass all your test samples.
Seems to be working with test_batch_size 128, thank you!
Hello, I am getting InternalError when requesting /predict: