rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

Experiments with IAM Database #10

Closed wellescastro closed 7 years ago

wellescastro commented 7 years ago

Hi, i'm trying reproduce the experiments with IAM database, but i'm facing some problems. First, i've followed the instructions in this comment https://github.com/rwth-i6/returnn/issues/5#issuecomment-283658973 in order to work with the most recent version of returnn, but when executing the config_real, i've got the following output:

`WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL: https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110) CRNN starting up, version 20170510.185843--git-e8453fc-dirty, pid 5181 CRNN command line options: ['config_real'] Theano: 0.9.0.dev-c697eeab84... ( in /home/dayvidwelles/anaconda2/lib/python2.7/site-packages/theano) Device train-network: Used data keys: ['classes', 'data', u'sizes'] using adam with nag and momentum schedule
Devices: Used in blocking / single proc mode. loading file features/raw/train_valid.h5 cached 616 seqs 1.88295629248 GB (fully loaded, 40.7837103745 GB left over) loading file features/raw/train.1.h5 loading file features/raw/train.2.h5 cached 5545 seqs 16.9867935553 GB (fully loaded, 39.2744432362 GB left over) Train data: input: 1 x 1 output: {u'classes': [79, 1], 'data': [1, 2], u'sizes': [2, 1]} HDF dataset, sequences: 5545, frames: 1519952558 Dev data: HDF dataset, sequences: 616, frames: 168484077 Devices: gpu0: GeForce GTX 1060 6GB (units: 1280 clock: 1.58Ghz memory: 6.0GB) working on 1 batch (update on device) Learning-rate-control: no file specified, not saving history (no proper restart possible) using adam with nag and momentum schedule Network layer topology: input #: 1 hidden 1Dto2D '1Dto2D' #: 1 hidden source 'classes_source' #: 2 hidden conv2 'conv0' #: 15 hidden conv2 'conv1' #: 45 hidden conv2 'conv2' #: 75 hidden conv2 'conv3' #: 105 hidden conv2 'conv4' #: 105 hidden mdlstm 'mdlstm0' #: 30 hidden mdlstm 'mdlstm1' #: 60 hidden mdlstm 'mdlstm2' #: 90 hidden mdlstm 'mdlstm3' #: 120 hidden mdlstm 'mdlstm4' #: 120 output softmax 'output' #: 82 net params #: 2627902 net trainable params: [W_conv0, b_conv0, W_conv1, b_conv1, W_conv2, b_conv2, W_conv3, b_conv3, W_conv4, b_conv4, U1_mdlstm0, U2_mdlstm0, U3_mdlstm0, U4_mdlstm0, V1_mdlstm0, V2_mdlstm0, V3_mdlstm0, V4_mdlstm0, W1_mdlstm0, W2_mdlstm0, W3_mdlstm0, W4_mdlstm0, b1_mdlstm0, b2_mdlstm0, b3_mdlstm0, b4_mdlstm0, U1_mdlstm1, U2_mdlstm1, U3_mdlstm1, U4_mdlstm1, V1_mdlstm1, V2_mdlstm1, V3_mdlstm1, V4_mdlstm1, W1_mdlstm1, W2_mdlstm1, W3_mdlstm1, W4_mdlstm1, b1_mdlstm1, b2_mdlstm1, b3_mdlstm1, b4_mdlstm1, U1_mdlstm2, U2_mdlstm2, U3_mdlstm2, U4_mdlstm2, V1_mdlstm2, V2_mdlstm2, V3_mdlstm2, V4_mdlstm2, W1_mdlstm2, W2_mdlstm2, W3_mdlstm2, W4_mdlstm2, b1_mdlstm2, b2_mdlstm2, b3_mdlstm2, b4_mdlstm2, U1_mdlstm3, U2_mdlstm3, U3_mdlstm3, U4_mdlstm3, V1_mdlstm3, V2_mdlstm3, V3_mdlstm3, V4_mdlstm3, W1_mdlstm3, W2_mdlstm3, W3_mdlstm3, W4_mdlstm3, b1_mdlstm3, b2_mdlstm3, b3_mdlstm3, b4_mdlstm3, U1_mdlstm4, U2_mdlstm4, U3_mdlstm4, U4_mdlstm4, V1_mdlstm4, V2_mdlstm4, V3_mdlstm4, V4_mdlstm4, W1_mdlstm4, W2_mdlstm4, W3_mdlstm4, W4_mdlstm4, b1_mdlstm4, b2_mdlstm4, b3_mdlstm4, b4_mdlstm4, W_in_mdlstm4_output, b_output] start training at epoch 1 and batch 0 using batch size: 1, max seqs: 10 learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None pretrain: None start epoch 1 with learning rate 0.0005 ... TaskThread train failed Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140063963887360)>, proc 5181. EXCEPTION Traceback (most recent call last): File "/home/dayvidwelles/MasterResearch/main/returnn/EngineTask.py", line 376, in run line: self.run_inner() locals: self = <TrainTaskThread(TaskThread train, started daemon 140063963887360)> self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140063963887360)>> File "/home/dayvidwelles/MasterResearch/main/returnn/EngineTask.py", line 401, in run_inner line: device.prepare(epoch=self.epoch, **self.get_device_prepare_args()) locals: device = <Device.Device object at 0x7f63a5393650> device.prepare = <bound method Device.prepare of <Device.Device object at 0x7f63a5393650>> epoch = self = <TrainTaskThread(TaskThread train, started daemon 140063963887360)> self.epoch = 1 self.get_device_prepare_args = <bound method TrainTaskThread.get_device_prepare_args of <TrainTaskThread(TaskThread train, started daemon 140063963887360)>> File "/home/dayvidwelles/MasterResearch/main/returnn/Device.py", line 1359, in prepare line: self.set_net_params(network) locals: self = <Device.Device object at 0x7f63a5393650> self.set_net_params = <bound method Device.set_net_params of <Device.Device object at 0x7f63a5393650>> network = <Network.LayerNetwork object at 0x7f6353c79290> File "/home/dayvidwelles/MasterResearch/main/returnn/Device.py", line 1172, in set_net_params line: self.trainnet.set_params_by_dict(network.get_params_dict()) locals: self = <Device.Device object at 0x7f63a5393650> self.trainnet = <Network.LayerNetwork object at 0x7f63a5354bd0> self.trainnet.set_params_by_dict = <bound method LayerNetwork.set_params_by_dict of <Network.LayerNetwork object at 0x7f63a5354bd0>> network = <Network.LayerNetwork object at 0x7f6353c79290> network.get_params_dict = <bound method LayerNetwork.get_params_dict of <Network.LayerNetwork object at 0x7f6353c79290>> File "/home/dayvidwelles/MasterResearch/main/returnn/Network.py", line 680, in set_params_by_dict line: self.output[name].set_params_by_dict(params[name]) locals: self = <Network.LayerNetwork object at 0x7f63a5354bd0> self.output = {'output': <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output>} name = 'output', len = 6 set_params_by_dict = params = {'1Dto2D': {}, 'mdlstm1': {'b4_mdlstm1': CudaNdarray([ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

                                  1. 0.
                                  1. 0.
            1. ..., len = 13 File "/home/dayvidwelles/MasterResearch/main/returnn/NetworkBaseLayer.py", line 148, in set_params_by_dict line: (self, p, self_param_shape, v.shape) locals: self = <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output> p = 'W_in_mdlstm4_output', len = 19 self_paramshape = (120, 80) v = CudaNdarray([[ 0.1616703 0.27070677 0.27121079 ..., 0.36024842 0.06087828 0.27304971] [-0.29783311 -0.72121382 0.00289656 ..., 0.37427774 0.14804624 0.28115278] [-0.03678599 0.25347066 -0.31765896 ..., -0.37810767 -0.14737074 0.09291939] ..., [-0.50088167 -0.01235303 -0.043..., len = 120, [0]: {len = 82, _[0]: {len = 0}} v.shape = (120, 82) AssertionError: In <<class 'NetworkOutputLayer.SequenceOutputLayer'> class:softmax name:output>, param W_in_mdlstm4_output shape does not match. Expected (120, 80), got (120, 82).

KeyboardInterrupt Quitting

UPDATE Same is happening when using old commits like bdbeb33

doetsch commented 7 years ago

I am wondering where this 82 comes from ... The issue is that your data files (correctly) have 79 different labels, +1 for blank should result in a softmax of size 80. However, apparently you configured the network to expect 82 labels instead, hence the error.

In the config file we specify the outputs as "num_outputs": {"data": [1,2], "classes": [79,1], "sizes": [2,1]}

Could it be possible that you accidentally changed this line?

wellescastro commented 7 years ago

Hi Doestsch, I didn't change this line. This is the content of config_real file:

{
"task": "train",
"device": "gpu0",
"multiprocessing": false,
"batching": "default",
"cache_size": "256G",
"max_seqs": 10,
"batch_size": 600000,
"nadam": 1,
"learning_rate": 0.0005,
"learning_rates": "{1: 0.0005, 25: 0.0003, 35: 0.0001}",
"num_epochs": 200,
"update_batch_size": 2000000,
"reinit": 0,
"log_verbosity": 5,
"train": "features/raw/train.1.h5,features/raw/train.2.h5",
"dev": "features/raw/train_valid.h5",
"start_epoch": 1,
"num_inputs": 1,
"num_outputs": {"data": [1,2], "classes": [79,1], "sizes": [2,1]},
"network": {
"classes_source": {"class": "source", "data_key": "sizes", "from": [], "n_out": 2, "dtype":"int32"},
"1Dto2D":  {"class": "1Dto2D", "from" : ["data", "classes_source"] },
"conv0":   { "class" : "conv2", "n_features" : 15, "filter" : [3, 3], "pool_size" : [2, 2],  "from" : ["1Dto2D"] },
"mdlstm0": { "class" : "mdlstm", "n_out" : 30, "dropout": 0.25, "from" : ["conv0"] },
"conv1" :  { "class" : "conv2", "n_features" : 45,  "dropout": 0.25, "filter" : [3, 3], "pool_size" : [2, 2],  "from" : ["mdlstm0"] },
"mdlstm1": { "class" : "mdlstm", "n_out" : 60, "dropout": 0.25, "from" : ["conv1"] },
"conv2" :  { "class" : "conv2", "n_features" : 75, "dropout": 0.25, "filter" : [3, 3], "pool_size" : [2, 2],  "from" : ["mdlstm1"] },
"mdlstm2": { "class" : "mdlstm", "n_out" : 90, "dropout": 0.25, "from" : ["conv2"] },
"conv3" :  { "class" : "conv2", "n_features" : 105, "dropout": 0.25, "filter" : [3, 3], "pool_size" : [1, 1],  "from" : ["mdlstm2"] },
"mdlstm3": { "class" : "mdlstm", "n_out" : 120, "dropout": 0.25, "from" : ["conv3"] },
"conv4" :  { "class" : "conv2", "n_features" : 105, "dropout": 0.25, "filter" : [3, 3], "pool_size" : [1, 1],  "from" : ["mdlstm3"] },
"mdlstm4": { "class" : "mdlstm", "n_out" : 120, "dropout": 0.25, "from" : ["conv4"] },
"output":  { "class" : "softmax", "loss": "ctc", "dropout": 0.25, "from" : ["mdlstm4"] }
},

"ctc_prior_file": "priors/mdlstm_real",
"model": "models/mdlstm_real",
"log": "log/mdlstm_real"
}

Additional informations: GPU: GTX 1060 6GB Theano: 0.9 Python version: 2.7 (using Anaconda2) This is the content of .theanorc file:

[global] 
floatX = float32
device = gpu0
[nvcc]
fastmath = True
[cuda]
root=/usr/local/cuda-8.0
[lib]
cnmem=0.9

Here's the script that i'm executing:

read -p "IAM username: " iam_username
wget http://www.fki.inf.unibe.ch/DBs/iamDB/data/lines/lines.tgz http://www.fki.inf.unibe.ch/DBs/iamDB/data/ascii/lines.txt --user $iam_username --ask-password
mkdir tmp_IAM_lines
tar -zxvf lines.tgz -C tmp_IAM_lines
find tmp_IAM_lines -name '*.png' -exec mv {} IAM_lines/ \;
rm -r tmp_IAM_lines
sed -i '252s|[#]||g' create_IAM_dataset.py
sed -i '253s|[#]||g' create_IAM_dataset.py
sed -i 's|config_demo|config_real|g' go.sh
sh go.sh

go.sh content:

./create_IAM_dataset.py
mkdir -p models log priors
python ../../../rnn.py config_real

I tried with the most recent commit, an old one (bdbeb33) and finally the commit 80a7c0e (which you said that was tested by you in another issue (https://github.com/rwth-i6/returnn/issues/4#issuecomment-283953571).

doetsch commented 7 years ago

I am not able the reproduce the error on the current HEAD. Please check the following:

  1. use the most recent commit, those former problems were all fixed
  2. rm models/* , so that it definitely doesn't load some previous model with possibly wrong settings
  3. device in .theanorc should be cpu. RETURNN creates its own GPU context internally
  4. look at the dataset called 'labels' in one of your feature files (e.g. h5dump -d labels path/to/file). It should print 79 labels, together with meta information. If not, then something went wrong during preprocessing (check if chars.txt was modified / indeed contains only 79 labels)
wellescastro commented 7 years ago

After check all these steps, the error still happens. Then, I tried two scenarios:

albertz commented 7 years ago

multiprocessing = False is not well tested and almost no one use it, so I expect that there are bugs. Feel free to fix that! Pull requests are welcome.

Do you run out of GPU memory? Do you have device = cpu in your theanorc? That's important, otherwise, with multiprocessing = True, you will have two copies of your model in the GPU memory, one in the parent crnn process and one it its child process.

Or do you run out of CPU memory? In that case, reduce your cache_size option.

doetsch commented 7 years ago

As albertz already mentioned, multiprocessing should never be false. Reducing cache_size to something around 6G should help if you run out of system memory.

Regarding GPU memory it is really very important to set device to cpu in the theanorc, otherwise it will load the whole network twice into GPU memory. If you still run out of GPU memory, reduce batch_size in the config file.

wellescastro commented 7 years ago

Hi, albertz and doetsch!

First, I would like to thank you all for the support. I didn't know that the right way to deal with out of memory errors concerns changing the _cachesize or _batchsize instead of just setting multiprocessing = false. Here's what happened:

  1. After setting device = cpu, the system stopped loading the network twice in GPU memory.
  2. After setting multiprocessing = true, the system was running out of CPU memory, but then I realize that the TaskSystem.py --forkExecProc process wasn't really killed after my previous attempts raised an exception, so, after rebooting my computer the RNN started to train correctly.

But then, the network crashes at some point about epoch 2, which seems that I was running out of system memory again, so, I set cache_size = 6G and this time the network went further in training but crashes over again. So, I kept decreasing the cache_size and more and more the network reached a later point. After setting the cache_size = 0G, the network did not crash until now, but it's taking too much time to finishes each epoch and the RAM Memory is increasing so I think at some the network will stop. Indeed, all executions I performed were taking a long time. So, given the log and computer specifications bellow, do you think it's normal to take this long? (the log concerns execution with cache_size = 0G)

train epoch 1, batch 3749, cost:output 2.66122289804, elapsed 2:38:41, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 2:38:41, 99.22% computing, 0.03% updating data                                                                                                                                                       
Save model from epoch 1 under models/mdlstm_real.001          

train epoch 2, batch 3749, cost:output 0.835049109025, elapsed 5:15:45, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 5:15:45, 49.84% computing, 0.01% updating data                                                                                                                                                       
Save model from epoch 2 under models/mdlstm_real.002       

train epoch 3, batch 3749, cost:output 0.447978720945, elapsed 5:12:03, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 5:12:03, 50.38% computing, 0.02% updating data                                                                                                                                                       
Save model from epoch 3 under models/mdlstm_real.003    

train epoch 4, batch 3749, cost:output 0.354026508331, elapsed 3:19:50, exp. remaining 0:00:00, complete 100.00%
Device gpuX proc epoch time stats: total 3:19:50, 79.01% computing, 0.05% updating data                                                                                                                                                       
Save model from epoch 4 under models/mdlstm_real.004   

Obs.: Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110)

albertz commented 7 years ago

Btw, the stats total 2:38:41, 99.22% computing, 0.03% updating data tell you how much fraction of that time was spent inside the Theano function computation, so the first epoch looks good. In the other epochs, it seems like half of the time is spent somewhere else - probably at data loading.

That sounds like some memory leak. Maybe in the Dataset somewhere, maybe also Theano. Did you try some other Theano version? Otherwise, can you try with some Python memory profiler?

wellescastro commented 7 years ago

Well, after running some tests with Theano 0.9 (CUDA 8.0 and cuDNN 5.1) and now with Theano 0.8.2 (CUDA 8.0 and cuDNN 5.0.5), the slowness still continues and as you alread pointed, is probably at data loading.

It's always the same behavior: in my first epoch, as the data is loaded before the training begins, the execution is quite fast, here's a short excerpt from the log at epoch 1:

running 2 sequence slices (477432 nts) of batch 0 on device gpu0
train epoch 1, batch 0, cost:output 19.2419134911, elapsed 0:00:02, exp. remaining 1:07:28, complete 0.05%
running 3 sequence slices (456672 nts) of batch 1 on device gpu0
train epoch 1, batch 1, cost:output 15.6205499822, elapsed 0:00:03, exp. remaining 0:56:34, complete 0.11%
running 2 sequence slices (386750 nts) of batch 2 on device gpu0
train epoch 1, batch 2, cost:output 16.7889182351, elapsed 0:00:05, exp. remaining 1:02:21, complete 0.14%

Looking at the end of the log at epoch 1 we got: Device gpuX proc epoch time stats: total 2:39:01, 99.24% computing, 0.03% updating data

Following, when second epoch begins, we can see that at each training step the system is always loading the data. So, it's visible that execution starts to slow down every single time that is loading file features/raw/train.1.h5 or loading file features/raw/train.2.h5.

train epoch 2, batch 8, cost:output 2.69350286116, elapsed 0:00:26, exp. remaining 2:45:31, complete 0.27%
loading file features/raw/train.1.h5
running 1 sequence slices (508667 nts) of batch 9 on device gpu0
loading file features/raw/train.2.h5
train epoch 2, batch 9, cost:output 2.34307306463, elapsed 0:00:31, exp. remaining 2:44:17, complete 0.32%
running 2 sequence slices (473478 nts) of batch 10 on device gpu0
loading file features/raw/train.2.h5
train epoch 2, batch 10, cost:output 3.02731769005, elapsed 0:00:33, exp. remaining 2:41:41, complete 0.35%
running 2 sequence slices (495972 nts) of batch 11 on device gpu0
loading file features/raw/train.1.h5
loading file features/raw/train.2.h5
train epoch 2, batch 11, cost:output 2.32541390352, elapsed 0:00:35, exp. remaining 2:40:01, complete 0.37%
running 1 sequence slices (251024 nts) of batch 12 on device gpu0
train epoch 2, batch 12, cost:output 2.3515547149, elapsed 0:00:38, exp. remaining 2:37:56, complete 0.40%
running 2 sequence slices (413960 nts) of batch 13 on device gpu0
train epoch 2, batch 13, cost:output 2.1086822159, elapsed 0:00:39, exp. remaining 2:45:31, complete 0.40%

In order to check how the training goes without having to switch between train1.h5 and train2.h5, I removed the second one and executed the first five epochs (note the results bellow). Examining the log: after finishing the first epoch, as usual the system starts to loading the dataset at each training step , which results in low computation. Continuing, about epoch 4 the ocurrences of "loading file features/raw/train.1.h5" decreased a lot and at epoch 5 the data is loaded just one time, which justifies the99.26% computing.

1: Device gpuX proc epoch time stats: total 1:19:18, 99.41% computing, 0.03% updating data 2: Device gpuX proc epoch time stats: total 1:54:09, 69.49% computing, 0.02% updating data 3: Device gpuX proc epoch time stats: total 1:47:42, 73.55% computing, 0.02% updating data 4: Device gpuX proc epoch time stats: total 1:27:25, 90.51% computing, 0.03% updating data 5: Device gpuX proc epoch time stats: total 1:19:48, 99.26% computing, 0.03% updating data

So, I will try to perform a complete execution and hopefully, the system will stabilize after passing through epochs two and three. Unfortunately, I'm still seeing a gradual increase in the ram memory over the epochs even with cache_size = 0G, so maybe I will face another out of memory error again. Anyway, I will buy another 8GB memory stick ram just in case.

doetsch commented 7 years ago

cache_size 0 is obviously extremely slow because it will then load every image from disk before processing it. I would need to look into the problem when cache_size is > 0 but not large enough to store all data ... if you still want to see some results then just reduce the data, e.g. by adjusting lines.txt before doing the preprocessing.

wellescastro commented 7 years ago

Hi! After testing with 16 GB RAM, the training is performing well. Now I'm able to train without having to change cache_size or reduce the dataset, I just used the original config file.

I trained the 10-layers model and then I used the config_fwd in order to produce the log-posteriors for the validation set of IAM. As I'm trying to replicate the "Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks", the current challenge for me is to understand how I can use the RASR Toolkit in order to decode the hd5 generated on forwarding by emulating a CTC topology and applying the HMM decoder. I'm sorry for bothering you again, but, could you at least point me in the right direction on how I can accomplish this recognition step? Did you use something like this Hybrid Recognition with Recurrent Neural Networks guide or RWTH OCR Add-On? I'm having a hard time trying to find the correct procedure in RASR tool but I would like to use it.

Thank you!

selcouthlyBlue commented 7 years ago

Why does training take a long time even for just one epoch? I'm using config_real by the way.

Using gpu device 0: GeForce GTX 1070 (CNMeM is disabled, cuDNN 5105) ... Device gpu0 proc epoch time stats: total 5:43:03, 99.74% computing, 0.01% updating data Save model from epoch 1 under models/mdlstm_real.001

The specs of the machine:

GPU -- GTX 1070 RAM -- 32 GB

config file: config_real

{
"task": "train",
"device": "gpu0",
"multiprocessing": true,
"batching": "default",
"cache_size": "256G",
"max_seqs": 10,
"batch_size": 600000,
"nadam": 1,
"learning_rate": 0.01,
"learning_rates": "{1: 0.01, 25: 0.005, 35: 0.001}",
"num_epochs": 200,
"update_batch_size": 2000000,
"reinit": 0,
"log_verbosity": 5,
"train": "features/raw/train.1.h5,features/raw/train.2.h5",
"dev": "features/raw/train_valid.h5",
"start_epoch": 1,
"num_inputs": 1,
"num_outputs": {"data": [1,2], "classes": [79,1], "sizes": [2,1]},
... (network part omitted)
"ctc_prior_file": "priors/mdlstm_real",
"model": "models/mdlstm_real",
"log": "log/mdlstm_real"
}

Also, is it possible to use multiple GPUs for training?

pvoigtlaender commented 7 years ago

@selcouthlyBlue We noticed that with newer hardware and/or CUDA versions, sometimes the performance becomes considerable worse compared to what we usually used. It might be worth to try downgrading to CUDA 7.5 (assuming, you're using CUDA 8 at the moment). If this improve the performance, please let us know. Some benchmarks of a different implementation with different CUDA versions can be found here: https://github.com/jpuigcerver/rnn2d/wiki/LSTM-2D

interxuxing commented 7 years ago

@selcouthlyBlue How long do you take to train one epoch on IAM dataset? I follow the instructions in https://gist.github.com/cwig/315d212964542f7f1797d5fdd122891e, and directly use the config_real file as @selcouthlyBlue . On my server, GPU -- GTX 1080 RAM -- 64 GB, cuda 8.0, cudnn 5.1, it takes almost 2 days for only 1 epoch! Wired..

Do you mean if we degrade the cuda to version 7.5, the speed will be faster?

In the ICFHR 2016 paper, it reports that each epoch is less than 1 hour (54 min) in Table I, but on my machine, it seems take much more time, as follows:

tim 20170915212204

what is the problem? @pvoigtlaender thank you~

pvoigtlaender commented 7 years ago

Hi,

I just checked my log file (which is from december 2016, however. I will rerun it with a new version when I have time. With a GTX 980 Ti, I got this log:

using batch size: 600000, max seqs: 10 learning rate control: ConstantLearningRate(defaultLearningRate=0.0005, minLearningRate=0.0, defaultLearningRates={1: 0.0005, 25: 0.0003, 35: 0.0001}, errorMeasureKey=None, relativeErrorAlsoRelativeToLearningRate=False, minNumEpochsPerNewLearningRate=0, filename=None), epoch data: 1: EpochData(learningRate=0.0005, error={}), 25: EpochData(learningRate=0.0003, error={}), 35: EpochData(learningRate=0.0001, error={}), error key: None pretrain: None start epoch 1 with learning rate 0.0005 ... starting task train running 2 sequence slices (414612 nts) of batch 0 on device gpu0 train epoch 1, batch 0, cost:output 19.2419147897, elapsed 0:00:01, exp. remaining 1:00:12, complete 0.05% running 3 sequence slices (360192 nts) of batch 1 on device gpu0 train epoch 1, batch 1, cost:output 15.7143249512, elapsed 0:00:03, exp. remaining 0:49:17, complete 0.11% running 3 sequence slices (517080 nts) of batch 2 on device gpu0 train epoch 1, batch 2, cost:output 16.2926053779, elapsed 0:00:04, exp. remaining 0:49:48, complete 0.16% running 3 sequence slices (584640 nts) of batch 3 on device gpu0 train epoch 1, batch 3, cost:output 15.2745442708, elapsed 0:00:06, exp. remaining 0:51:29, complete 0.22% running 4 sequence slices (561000 nts) of batch 4 on device gpu0 train epoch 1, batch 4, cost:output 15.2505910238, elapsed 0:00:08, exp. remaining 1:02:52, complete 0.22% running 4 sequence slices (513480 nts) of batch 5 on device gpu0 train epoch 1, batch 5, cost:output 14.0684108734, elapsed 0:00:09, exp. remaining 0:44:10, complete 0.36% running 3 sequence slices (490590 nts) of batch 6 on device gpu0 train epoch 1, batch 6, cost:output 9.54984585074, elapsed 0:00:11, exp. remaining 0:44:39, complete 0.41% running 3 sequence slices (456960 nts) of batch 7 on device gpu0 train epoch 1, batch 7, cost:output 7.57784688104, elapsed 0:00:12, exp. remaining 0:44:47, complete 0.47% running 2 sequence slices (479160 nts) of batch 8 on device gpu0 train epoch 1, batch 8, cost:output 14.5766351873, elapsed 0:00:14, exp. remaining 0:48:15, complete 0.50% running 3 sequence slices (597780 nts) of batch 9 on device gpu0 train epoch 1, batch 9, cost:output 12.5525734641, elapsed 0:00:16, exp. remaining 0:49:06, complete 0.56% running 4 sequence slices (565800 nts) of batch 10 on device gpu0 train epoch 1, batch 10, cost:output 9.72981361005, elapsed 0:00:18, exp. remaining 0:53:39, complete 0.56% running 2 sequence slices (286810 nts) of batch 11 on device gpu0 train epoch 1, batch 11, cost:output 10.387516345, elapsed 0:00:19, exp. remaining 0:48:18, complete 0.67% running 2 sequence slices (414810 nts) of batch 12 on device gpu0 train epoch 1, batch 12, cost:output 16.8784329928, elapsed 0:00:21, exp. remaining 0:50:02, complete 0.70% running 2 sequence slices (418110 nts) of batch 13 on device gpu0 train epoch 1, batch 13, cost:output 8.35984860191, elapsed 0:00:23, exp. remaining 0:51:37, complete 0.74% running 3 sequence slices (359832 nts) of batch 14 on device gpu0 train epoch 1, batch 14, cost:output 13.6388293457, elapsed 0:00:24, exp. remaining 0:50:41, complete 0.79% running 2 sequence slices (435252 nts) of batch 15 on device gpu0 train epoch 1, batch 15, cost:output 5.76873870394, elapsed 0:00:26, exp. remaining 0:52:19, complete 0.83% running 2 sequence slices (409510 nts) of batch 16 on device gpu0 train epoch 1, batch 16, cost:output 9.41634662335, elapsed 0:00:28, exp. remaining 0:53:36, complete 0.87% running 2 sequence slices (455600 nts) of batch 17 on device gpu0 train epoch 1, batch 17, cost:output 6.37458184425, elapsed 0:00:30, exp. remaining 0:57:23, complete 0.87% running 3 sequence slices (494832 nts) of batch 18 on device gpu0 train epoch 1, batch 18, cost:output 7.54132385254, elapsed 0:00:31, exp. remaining 0:54:51, complete 0.96% running 2 sequence slices (429350 nts) of batch 19 on device gpu0 train epoch 1, batch 19, cost:output 4.49527751866, elapsed 0:00:33, exp. remaining 0:55:56, complete 0.99% running 2 sequence slices (451836 nts) of batch 20 on device gpu0 So the epoch time is estimated less than 1 hour.

It's hard to say, why it's sooo slow for you. The issue with CUDA led to a slowdown, but it was not so significant. You should make sure, that cudnn is setup correctly. Also I think at some point we had a bug with the dataset loading, which made things slow. You could try reverting to an older version of RETURNN (e.g. when the mdlstm demo was added) and see, if it changes anything.

pvoigtlaender commented 7 years ago

I just updated to the new version and ran it again. Things got slower, but only by a factor of less than 3, so this could be the slowdown related to CUDA:

starting task train running 2 sequence slices (414612 nts) of batch 0 on device gpu0 train epoch 1, batch 0, cost:output 19.2419147897, elapsed 0:00:06, exp. remaining 3:20:56, complete 0.05% running 3 sequence slices (360192 nts) of batch 1 on device gpu0 train epoch 1, batch 1, cost:output 15.7143249512, elapsed 0:00:10, exp. remaining 2:35:16, complete 0.11% running 3 sequence slices (517080 nts) of batch 2 on device gpu0 train epoch 1, batch 2, cost:output 16.2926039585, elapsed 0:00:15, exp. remaining 2:34:47, complete 0.16% running 3 sequence slices (584640 nts) of batch 3 on device gpu0 train epoch 1, batch 3, cost:output 15.2745442708, elapsed 0:00:20, exp. remaining 2:39:58, complete 0.22% running 4 sequence slices (561000 nts) of batch 4 on device gpu0 train epoch 1, batch 4, cost:output 15.2505910238, elapsed 0:00:24, exp. remaining 3:10:56, complete 0.22% running 4 sequence slices (513480 nts) of batch 5 on device gpu0 train epoch 1, batch 5, cost:output 14.0684108734, elapsed 0:00:28, exp. remaining 2:12:51, complete 0.36% running 3 sequence slices (490590 nts) of batch 6 on device gpu0 train epoch 1, batch 6, cost:output 9.54984585074, elapsed 0:00:33, exp. remaining 2:14:54, complete 0.41% running 3 sequence slices (456960 nts) of batch 7 on device gpu0 train epoch 1, batch 7, cost:output 7.57785443178, elapsed 0:00:38, exp. remaining 2:15:11, complete 0.47% running 2 sequence slices (479160 nts) of batch 8 on device gpu0 train epoch 1, batch 8, cost:output 14.5763586796, elapsed 0:00:45, exp. remaining 2:29:12, complete 0.50% running 3 sequence slices (597780 nts) of batch 9 on device gpu0 However, your slowdown was more than a factor of 10, so it looks like something else.

interxuxing commented 7 years ago

hi, @pvoigtlaender , thank you very much for your kind help! yes, according to your advice, i reinstall the cuda-7.5, it seems that accelerated the training speed, now the time for 1 epoch is about 11 hours, still very slow.. see the log below:

1080ti-cuda75

further, when i changed the version of returnn to December 2016, i met some problems of the capabilities of cuda-7.5 and theano=0.8.2. I will check their capablities and then report the result.