jolibrain / deepdetect

Deep Learning API and Server in C++14 support for PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE
https://www.deepdetect.com/
Other
2.52k stars 561 forks source link

Cannot indicate a Train and a Test folder using txt connector #348

Closed YaYaB closed 7 years ago

YaYaB commented 7 years ago

I am trying to train a network based on the template vdcnn_9 using the dataset you indicated in #340 but I can not specify my own split of train and test.

Configuration

Your question / the problem you're facing:

I want to train using an already split dataset in train and test. However it does not seem possible with the txt container. If you put two folders or files as you would do using the svm connector it just concatenate them in our case.

Here is a small example.

  1. Usual training

Then we will just create a usual training with the split managed by DD.

OUTPUT

DeepDetect [ commit 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b ]

INFO - 14:40:39 - Running DeepDetect HTTP server on localhost:8100
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0905 14:40:43.922724  2989 caffelib.cc:131] instantiating model template vdcnn_9
I0905 14:40:43.922771  2989 caffelib.cc:135] source=../../templates/caffe//vdcnn_9/
I0905 14:40:43.922777  2989 caffelib.cc:136] dest=/home/test/vdcnn_9.prototxt

INFO - 14:40:43 - Tue Sep  5 14:40:43 2017 UTC - 127.0.0.1 "PUT /services/agnews" 201 1001

OUTPUT

INFO - 14:42:15 - Tue Sep 5 14:42:15 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:42:15.974807 3013 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600I0905 14:42:18.429330 3013 txtinputfileconn.cc:192] vocabulary size=0 data split test size=7656 / remaining data size=119944 vocab size=0

INFO - 14:42:18 - db_batchsize=119944 / db_testbatchsize=7656

INFO - 14:42:18 - Opened lmdb /home/test/train.lmdb ... INFO - 14:42:31 - Opened lmdb /home/test/test.lmdb INFO - 14:42:31 - Processed 1000 text entries INFO - 14:42:31 - Processed 2000 text entries INFO - 14:42:31 - Processed 3000 text entries INFO - 14:42:31 - Processed 4000 text entries INFO - 14:42:31 - Processed 5000 text entries INFO - 14:42:31 - Processed 6000 text entries INFO - 14:42:31 - Processed 7000 text entries


As you can see the train set contains 120k examples and the test contains 7.6k examples. Two LMDB are created `train.lmdb` and `test.lmdb`.
The train is launched and it goes well.

2. Training with own train and test split

Then we will just create a usual training with the split managed by DD.
- Service Creation
INPUT

curl -X PUT 'http://localhost:8100/services/agnews' -d '{"mllib":"caffe","description":"agnews classifier","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"embedding":true,"sequence":1014},"mllib":{"nclasses":4,"template":"vdcnn_9"}},"model":{"templates":"../../templates/caffe/","repository":"/home/test"}}'


OUTPUT

./dede --port 8100
DeepDetect [ commit 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b ]

INFO - 14:29:03 - Running DeepDetect HTTP server on localhost:8100 WARNING: Logging before InitGoogleLogging() is written to STDERR I0905 14:29:26.320225 2632 caffelib.cc:131] instantiating model template vdcnn_9 I0905 14:29:26.320309 2632 caffelib.cc:135] source=../../templates/caffe//vdcnn_9/ I0905 14:29:26.320319 2632 caffelib.cc:136] dest=/home/test/vdcnn_9.prototxt

INFO - 14:29:26 - Tue Sep 5 14:29:26 2017 UTC - 127.0.0.1 "PUT /services/agnews" 201 1506

- Train Launch
The `test_split` is by default set to 0 and two folders are given to the parameters ```data```.
INPUT

curl -X POST 'http://localhost:8100/train' -d '{"service":"agnews","parameters":{"input":{"connector":"txt","embedding":true,"sentences":true,"db":true,"shuffle":true},"mllib":{"gpu":true,"net":{"batch_size":128,"test_batch_size":128},"solver":{"test_interval":1000,"snpashot":1000,"base_lr":0.01,"solver_type":"ADAM","iterations":25000}},"output":{"measure":["mcll","f1","cmdiag","cmfull"]}},"data":["${PATH_DATA}/agnews/split/train","${PATH_DATA}/agnews/split/test"]}'


OUTPUT

INFO - 14:32:55 - Tue Sep 5 14:32:55 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:32:55.195322 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=120000 I0905 14:32:57.468423 2723 txtinputfileconn.cc:192] vocabulary size=0 I0905 14:32:57.468482 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600 I0905 14:32:57.586179 2723 txtinputfileconn.cc:192] vocabulary size=0

INFO - 14:32:57 - db_batchsize=127600 / db_testbatchsize=0

INFO - 14:32:57 - Opened lmdb /home/test/train.lmdb etc.



As you can see the train set contains now 127.6k examples and no test set was created. All the data was concatenated to one big dataset: `train.lmdb`

If we take a closer look at the code behind the connectors, we clearly see that for the svm connector the train and test sets are indicated in the code (see line 138 of https://github.com/beniz/deepdetect/blob/master/src/svminputfileconn.h).
However for the txt connector there is no such thing. The `read_element` function and the `push_back` only indicate that we concatenate the data and nothing more (see lines 266 and 271 of https://github.com/beniz/deepdetect/blob/802f316d9c04c9651e2d18a6f1ddf05d7608e195/src/txtinputfileconn.h)
beniz commented 7 years ago

Thanks for catching this one! Could you verify that PR #350 does fix this issue ? I've tested it on the above dataset and it does work for me with the PR. Thanks.

YaYaB commented 7 years ago

Thanks for the quickness of the PR. Thanks to i, it's possible to train by giving a train and a test set. It does create a train.lmdb and a test.lmdb accordingly.

However there are several issues. When building those datasets, it dumps a file called corresp.txt. This file is the map between our labels and the labels in the lmdb.

In our case train subdirectories and test subdirectories are not read in the same order. That means the mapping of our labels and the ones in train.lmdb is different than the one between our labels and the test.lmdb. As we train on train.lmdb, we basically define train.lmdb labels as ground truth and when we test on test.lmdb that uses a different mapping we will not get representative results. Thus the metrics during training phase are not correct (except for the one on the train set as train_loss, etc.)

In our case we give a train, then a test set. DD will then dump first the corresp.txt associated to the train data then it will replace this file by the one for the test data. That means that we lost the mapping that our model will use. If we want to make a prediction we need that mapping

YaYaB commented 7 years ago

The easiest solution, in my opinion, would be to only dump the corresp.txt file associated to the train file and force the mapping of the test set to be similar to the one of the train set.

beniz commented 7 years ago

@YaYaB thanks for the update. Can you test the PR again, the two bullet points above have been dealt with and appear to work for us. I don't think the reading order of repositories can change, but the mapping has been enforced as it already was in the image connector. Thanks.

YaYaB commented 7 years ago

Thanks for the update of the PR. Let me show you what I obtain on my side with the corresp.txt. If I launch a training using only the train folder my corresp will contain the following:

3 2
2 1
1 3
0 4

However if now I use only the test folder during training I obtain the following inside the corresp file:

3 3
2 4
1 2
0 1

According to this (http://man7.org/linux/man-pages/man3/readdir.3.html):

The order in which filenames are read by successive calls to
       readdir() depends on the filesystem implementation; it is unlikely
       that the names will be sorted in any fashion.

Then I am not surprised to see that I obtain different corresp files with different folders. I am testing your new PR now, I will come back to you when it's over

YaYaB commented 7 years ago

It works well for me! FYI, I tested the test_split argument. Basically it still works, the train is split accordingly to the value of the test_split and it is then appended to the test set. Is it something that we want? Or is it better to force test_split to 0 when a test set is indicated?

beniz commented 7 years ago

Yes, rational is that first directory is train, the remaining directories are all test.

beniz commented 7 years ago

Thanks for the thorough report. Fix has been merged.