Cannot indicate a Train and a Test folder using txt connector

YaYaB commented 7 years ago

I am trying to train a network based on the template vdcnn_9 using the dataset you indicated in #340 but I can not specify my own split of train and test.

Configuration

Version of DeepDetect:
- Locally compiled on Ubuntu 16.04 LTS
Commit (shown by the server when starting):
- Branch Master, 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b
GPUS:
- 1 x Nvidia GTX Titan X (Maxwell)

Your question / the problem you're facing:

I want to train using an already split dataset in train and test. However it does not seem possible with the txt container. If you put two folders or files as you would do using the svm connector it just concatenate them in our case.

Here is a small example.

First download the dataset: https://ufile.io/tibzi
Extract the dataset:
```
tar -xvf agnews.tar.gz
```
PATH_DATA will represent the path to the agnews extracted folder.

Usual training

Then we will just create a usual training with the split managed by DD.

Service Creation INPUT

curl -X PUT 'http://localhost:8100/services/agnews' -d '{"mllib":"caffe","description":"agnews classifier","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"embedding":true,"sequence":1014},"mllib":{"nclasses":4,"template":"vdcnn_9"}},"model":{"templates":"../../templates/caffe/","repository":"/home/test"}}'

OUTPUT

DeepDetect [ commit 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b ]

INFO - 14:40:39 - Running DeepDetect HTTP server on localhost:8100
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0905 14:40:43.922724  2989 caffelib.cc:131] instantiating model template vdcnn_9
I0905 14:40:43.922771  2989 caffelib.cc:135] source=../../templates/caffe//vdcnn_9/
I0905 14:40:43.922777  2989 caffelib.cc:136] dest=/home/test/vdcnn_9.prototxt

INFO - 14:40:43 - Tue Sep  5 14:40:43 2017 UTC - 127.0.0.1 "PUT /services/agnews" 201 1001

Train Launch INPUT


curl -X POST 'http://localhost:8100/train' -d '{"service":"agnews","parameters":{"input":{"connector":"txt","embedding":true,"sentences":true,"db":true,"test_split":0.06,"shuffle":true},"mllib":{"gpu":true,"net":{"batch_size":128,"test_batch_size":128},"solver":{"test_interval":1000,"snpashot":1000,"base_lr":0.01,"solver_type":"ADAM","iterations":25000}},"output":{"measure":["mcll","f1","cmdiag","cmfull"]}},"data":["${PATH_DATA}/agnews/full"]}'


OUTPUT

INFO - 14:42:15 - Tue Sep 5 14:42:15 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:42:15.974807 3013 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600I0905 14:42:18.429330 3013 txtinputfileconn.cc:192] vocabulary size=0 data split test size=7656 / remaining data size=119944 vocab size=0

INFO - 14:42:18 - db_batchsize=119944 / db_testbatchsize=7656

INFO - 14:42:18 - Opened lmdb /home/test/train.lmdb ... INFO - 14:42:31 - Opened lmdb /home/test/test.lmdb INFO - 14:42:31 - Processed 1000 text entries INFO - 14:42:31 - Processed 2000 text entries INFO - 14:42:31 - Processed 3000 text entries INFO - 14:42:31 - Processed 4000 text entries INFO - 14:42:31 - Processed 5000 text entries INFO - 14:42:31 - Processed 6000 text entries INFO - 14:42:31 - Processed 7000 text entries


As you can see the train set contains 120k examples and the test contains 7.6k examples. Two LMDB are created `train.lmdb` and `test.lmdb`.
The train is launched and it goes well.

2. Training with own train and test split

Then we will just create a usual training with the split managed by DD.
- Service Creation
INPUT

curl -X PUT 'http://localhost:8100/services/agnews' -d '{"mllib":"caffe","description":"agnews classifier","type":"supervised","parameters":{"input":{"connector":"txt","characters":true,"embedding":true,"sequence":1014},"mllib":{"nclasses":4,"template":"vdcnn_9"}},"model":{"templates":"../../templates/caffe/","repository":"/home/test"}}'


OUTPUT

./dede --port 8100
DeepDetect [ commit 62ea2f4c84fe2f10ea9f1ed465139e7a1fc0ef8b ]

INFO - 14:29:03 - Running DeepDetect HTTP server on localhost:8100 WARNING: Logging before InitGoogleLogging() is written to STDERR I0905 14:29:26.320225 2632 caffelib.cc:131] instantiating model template vdcnn_9 I0905 14:29:26.320309 2632 caffelib.cc:135] source=../../templates/caffe//vdcnn_9/ I0905 14:29:26.320319 2632 caffelib.cc:136] dest=/home/test/vdcnn_9.prototxt

INFO - 14:29:26 - Tue Sep 5 14:29:26 2017 UTC - 127.0.0.1 "PUT /services/agnews" 201 1506

- Train Launch
The `test_split` is by default set to 0 and two folders are given to the parameters ```data```.
INPUT

curl -X POST 'http://localhost:8100/train' -d '{"service":"agnews","parameters":{"input":{"connector":"txt","embedding":true,"sentences":true,"db":true,"shuffle":true},"mllib":{"gpu":true,"net":{"batch_size":128,"test_batch_size":128},"solver":{"test_interval":1000,"snpashot":1000,"base_lr":0.01,"solver_type":"ADAM","iterations":25000}},"output":{"measure":["mcll","f1","cmdiag","cmfull"]}},"data":["${PATH_DATA}/agnews/split/train","${PATH_DATA}/agnews/split/test"]}'


OUTPUT

INFO - 14:32:55 - Tue Sep 5 14:32:55 2017 UTC - 127.0.0.1 "POST /train" 201 0 I0905 14:32:55.195322 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=120000 I0905 14:32:57.468423 2723 txtinputfileconn.cc:192] vocabulary size=0 I0905 14:32:57.468482 2723 txtinputfileconn.cc:74] txtinputfileconn: list subdirs size=4 loaded text samples=127600 I0905 14:32:57.586179 2723 txtinputfileconn.cc:192] vocabulary size=0

INFO - 14:32:57 - db_batchsize=127600 / db_testbatchsize=0

INFO - 14:32:57 - Opened lmdb /home/test/train.lmdb etc.



As you can see the train set contains now 127.6k examples and no test set was created. All the data was concatenated to one big dataset: `train.lmdb`

If we take a closer look at the code behind the connectors, we clearly see that for the svm connector the train and test sets are indicated in the code (see line 138 of https://github.com/beniz/deepdetect/blob/master/src/svminputfileconn.h).
However for the txt connector there is no such thing. The `read_element` function and the `push_back` only indicate that we concatenate the data and nothing more (see lines 266 and 271 of https://github.com/beniz/deepdetect/blob/802f316d9c04c9651e2d18a6f1ddf05d7608e195/src/txtinputfileconn.h)

beniz commented 7 years ago

Thanks for catching this one! Could you verify that PR #350 does fix this issue ? I've tested it on the above dataset and it does work for me with the PR. Thanks.

YaYaB commented 7 years ago

Thanks for the quickness of the PR. Thanks to i, it's possible to train by giving a train and a test set. It does create a train.lmdb and a test.lmdb accordingly.

However there are several issues. When building those datasets, it dumps a file called corresp.txt. This file is the map between our labels and the labels in the lmdb.

Different mapping I have noted previously that this file is created when reading the subdirectories of the directory containing the datas. There is no order defined on how to read those subdirectories. That means that we can read folders 1,2,3,4 or 2,3,1,4. It mainly depends on the function readdir (line 85 in https://github.com/beniz/deepdetect/blob/5067658a69deeda46e06f14b0527180f59d018a1/src/utils/fileops.hpp) that does not order the files.

In our case train subdirectories and test subdirectories are not read in the same order. That means the mapping of our labels and the ones in train.lmdb is different than the one between our labels and the test.lmdb. As we train on train.lmdb, we basically define train.lmdb labels as ground truth and when we test on test.lmdb that uses a different mapping we will not get representative results. Thus the metrics during training phase are not correct (except for the one on the train set as train_loss, etc.)

Disappearance of the correct corresp file In fact DD dumps a corresp.txt each time it reads a directory (line 225 in https://github.com/beniz/deepdetect/blob/a977aedf579e9c2d482c0fc8def92000c24644e9/src/caffeinputconns.cc). However _correspname is fixed (line 352 in https://github.com/beniz/deepdetect/blob/5067658a69deeda46e06f14b0527180f59d018a1/src/txtinputfileconn.h)

In our case we give a train, then a test set. DD will then dump first the corresp.txt associated to the train data then it will replace this file by the one for the test data. That means that we lost the mapping that our model will use. If we want to make a prediction we need that mapping

YaYaB commented 7 years ago

The easiest solution, in my opinion, would be to only dump the corresp.txt file associated to the train file and force the mapping of the test set to be similar to the one of the train set.

beniz commented 7 years ago

@YaYaB thanks for the update. Can you test the PR again, the two bullet points above have been dealt with and appear to work for us. I don't think the reading order of repositories can change, but the mapping has been enforced as it already was in the image connector. Thanks.

YaYaB commented 7 years ago

Thanks for the update of the PR. Let me show you what I obtain on my side with the corresp.txt. If I launch a training using only the train folder my corresp will contain the following:

However if now I use only the test folder during training I obtain the following inside the corresp file:

According to this (http://man7.org/linux/man-pages/man3/readdir.3.html):

The order in which filenames are read by successive calls to
       readdir() depends on the filesystem implementation; it is unlikely
       that the names will be sorted in any fashion.

Then I am not surprised to see that I obtain different corresp files with different folders. I am testing your new PR now, I will come back to you when it's over

YaYaB commented 7 years ago

It works well for me! FYI, I tested the test_split argument. Basically it still works, the train is split accordingly to the value of the test_split and it is then appended to the test set. Is it something that we want? Or is it better to force test_split to 0 when a test set is indicated?

beniz commented 7 years ago

Yes, rational is that first directory is train, the remaining directories are all test.

beniz commented 7 years ago

Thanks for the thorough report. Fix has been merged.

jolibrain / deepdetect

Cannot indicate a Train and a Test folder using txt connector #348

Configuration

Your question / the problem you're facing: