Closed YaYaB closed 7 years ago
Thanks for catching this one! Could you verify that PR #350 does fix this issue ? I've tested it on the above dataset and it does work for me with the PR. Thanks.
Thanks for the quickness of the PR. Thanks to i, it's possible to train by giving a train and a test set. It does create a train.lmdb and a test.lmdb accordingly.
However there are several issues.
When building those datasets, it dumps a file called corresp.txt
. This file is the map between our labels and the labels in the lmdb.
In our case train subdirectories and test subdirectories are not read in the same order. That means the mapping of our labels and the ones in train.lmdb is different than the one between our labels and the test.lmdb. As we train on train.lmdb, we basically define train.lmdb labels as ground truth and when we test on test.lmdb that uses a different mapping we will not get representative results. Thus the metrics during training phase are not correct (except for the one on the train set as train_loss, etc.)
correct
corresp file
In fact DD dumps a corresp.txt each time it reads a directory (line 225 in https://github.com/beniz/deepdetect/blob/a977aedf579e9c2d482c0fc8def92000c24644e9/src/caffeinputconns.cc). However _correspname
is fixed (line 352 in https://github.com/beniz/deepdetect/blob/5067658a69deeda46e06f14b0527180f59d018a1/src/txtinputfileconn.h)In our case we give a train, then a test set. DD will then dump first the corresp.txt
associated to the train data then it will replace this file by the one for the test data. That means that we lost the mapping that our model will use. If we want to make a prediction we need that mapping
The easiest solution, in my opinion, would be to only dump the corresp.txt file associated to the train file and force the mapping of the test set to be similar to the one of the train set.
@YaYaB thanks for the update. Can you test the PR again, the two bullet points above have been dealt with and appear to work for us. I don't think the reading order of repositories can change, but the mapping has been enforced as it already was in the image connector. Thanks.
Thanks for the update of the PR. Let me show you what I obtain on my side with the corresp.txt. If I launch a training using only the train folder my corresp will contain the following:
3 2
2 1
1 3
0 4
However if now I use only the test folder during training I obtain the following inside the corresp file:
3 3
2 4
1 2
0 1
According to this (http://man7.org/linux/man-pages/man3/readdir.3.html):
The order in which filenames are read by successive calls to
readdir() depends on the filesystem implementation; it is unlikely
that the names will be sorted in any fashion.
Then I am not surprised to see that I obtain different corresp files with different folders. I am testing your new PR now, I will come back to you when it's over
It works well for me! FYI, I tested the test_split argument. Basically it still works, the train is split accordingly to the value of the test_split and it is then appended to the test set. Is it something that we want? Or is it better to force test_split to 0 when a test set is indicated?
Yes, rational is that first directory is train, the remaining directories are all test.
Thanks for the thorough report. Fix has been merged.
I am trying to train a network based on the template
vdcnn_9
using the dataset you indicated in #340 but I can not specify my own split of train and test.Configuration
Master
, 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8bYour question / the problem you're facing:
I want to train using an already split dataset in train and test. However it does not seem possible with the txt container. If you put two folders or files as you would do using the svm connector it just concatenate them in our case.
Here is a small example.
PATH_DATA will represent the path to the agnews extracted folder.
Then we will just create a usual training with the split managed by DD.
OUTPUT
INFO - 14:42:15 - Tue Sep 5 14:42:15 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:42:15.974807 3013 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600I0905 14:42:18.429330 3013 txtinputfileconn.cc:192] vocabulary size=0 data split test size=7656 / remaining data size=119944 vocab size=0
INFO - 14:42:18 - db_batchsize=119944 / db_testbatchsize=7656
INFO - 14:42:18 - Opened lmdb /home/test/train.lmdb ... INFO - 14:42:31 - Opened lmdb /home/test/test.lmdb INFO - 14:42:31 - Processed 1000 text entries INFO - 14:42:31 - Processed 2000 text entries INFO - 14:42:31 - Processed 3000 text entries INFO - 14:42:31 - Processed 4000 text entries INFO - 14:42:31 - Processed 5000 text entries INFO - 14:42:31 - Processed 6000 text entries INFO - 14:42:31 - Processed 7000 text entries
curl -X PUT 'http://localhost:8100/services/agnews' -d '{"mllib":"caffe","description":"agnews classifier","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"embedding":true,"sequence":1014},"mllib":{"nclasses":4,"template":"vdcnn_9"}},"model":{"templates":"../../templates/caffe/","repository":"/home/test"}}'
./dede --port 8100
DeepDetect [ commit 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b ]
INFO - 14:29:03 - Running DeepDetect HTTP server on localhost:8100 WARNING: Logging before InitGoogleLogging() is written to STDERR I0905 14:29:26.320225 2632 caffelib.cc:131] instantiating model template vdcnn_9 I0905 14:29:26.320309 2632 caffelib.cc:135] source=../../templates/caffe//vdcnn_9/ I0905 14:29:26.320319 2632 caffelib.cc:136] dest=/home/test/vdcnn_9.prototxt
INFO - 14:29:26 - Tue Sep 5 14:29:26 2017 UTC - 127.0.0.1 "PUT /services/agnews" 201 1506
curl -X POST 'http://localhost:8100/train' -d '{"service":"agnews","parameters":{"input":{"connector":"txt","embedding":true,"sentences":true,"db":true,"shuffle":true},"mllib":{"gpu":true,"net":{"batch_size":128,"test_batch_size":128},"solver":{"test_interval":1000,"snpashot":1000,"base_lr":0.01,"solver_type":"ADAM","iterations":25000}},"output":{"measure":["mcll","f1","cmdiag","cmfull"]}},"data":["${PATH_DATA}/agnews/split/train","${PATH_DATA}/agnews/split/test"]}'
INFO - 14:32:55 - Tue Sep 5 14:32:55 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:32:55.195322 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=120000 I0905 14:32:57.468423 2723 txtinputfileconn.cc:192] vocabulary size=0 I0905 14:32:57.468482 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600 I0905 14:32:57.586179 2723 txtinputfileconn.cc:192] vocabulary size=0
INFO - 14:32:57 - db_batchsize=127600 / db_testbatchsize=0
INFO - 14:32:57 - Opened lmdb /home/test/train.lmdb etc.