Training is very slow - Githubissues

amrmalkhatib commented 6 years ago

I'm training a language model now, but it's very slow as the first epoch only will take like 7 days to be completed. my vocabulary size is about 64K, but the training data-set is about 1 million example. I've set --noise-sharing = batch, but it didn't help. Is it normal for TheanoLM to be slow like this in the training phase? What can I do to improve the training speed?

senarvi commented 6 years ago

It can be many things. Do you have a GPU and does Theano recognize it?

amrmalkhatib commented 6 years ago

Yes, I have Tesla K80, and yes Theano recognizes it. I've tested TheanoLM on small dataset with cpu and gpu, and of course gpu is alot faster. But even with GPU the full data-set will take 7 days to finish the first epoch.

senarvi commented 6 years ago

What is the model like? 64k is not that big vocabulary. Do you mean 1 million words or 1 million sentences? Are you using NCE or cross-entropy cost?

amrmalkhatib commented 6 years ago

I mean 1 million sentence, and the vocabs are 64K

I've also used the default settings in training

amrmalkhatib commented 6 years ago

Here is the log, you may find it useful:

Constructing vocabulary from training set. Number of words in vocabulary: 63462 Number of words in shortlist: 63462 Number of word classes: 63462 2017-11-11 19:00:59,379 train: TRAINING OPTIONS 2017-11-11 19:00:59,379 train: batch_size: 16 2017-11-11 19:00:59,379 train: max_annealing_count: 0 2017-11-11 19:00:59,379 train: sequence_length: 100 2017-11-11 19:00:59,380 train: stopping_criterion: annealing-count 2017-11-11 19:00:59,380 train: max_epochs: 100 2017-11-11 19:00:59,380 train: min_epochs: 1 2017-11-11 19:00:59,380 train: validation_frequency: 5 2017-11-11 19:00:59,380 train: patience: 4 2017-11-11 19:00:59,380 train: OPTIMIZATION OPTIONS 2017-11-11 19:00:59,380 train: num_noise_samples: 5 2017-11-11 19:00:59,380 train: learning_rate: 0.1 2017-11-11 19:00:59,380 train: sqr_gradient_decay_rate: 0.999 2017-11-11 19:00:59,380 train: method: adagrad 2017-11-11 19:00:59,380 train: noise_sharing: None 2017-11-11 19:00:59,380 train: epsilon: 1e-06 2017-11-11 19:00:59,380 train: max_gradient_norm: 5 2017-11-11 19:00:59,380 train: gradient_decay_rate: 0.9 2017-11-11 19:00:59,380 train: momentum: 0.9 2017-11-11 19:00:59,381 train: weights: [ 1.] Creating trainer. Computing the number of mini-batches in training data. Computing the number of mini-batches in training data. 2017-11-11 19:17:51,664 init: One epoch of training data contains 519678 mini-batch updates. 2017-11-11 19:17:51,670 init: Class unigram log probabilities are in the range [-inf, -2.262113]. 2017-11-11 19:17:51,670 init: Finding sentence start positions in masked_training_data.txt. 2017-11-11 19:17:56,078 _reset: Generating a random order of input lines. Building neural network. 2017-11-11 19:17:56,689 init: Creating layers. 2017-11-11 19:17:56,689 init: - NetworkInput name=class_input inputs=[] size=63462 activation=tanh devices=[] 2017-11-11 19:17:56,689 init: - ProjectionLayer name=projection_layer inputs=[class_input] size=100 activation=tanh devices=[None] 2017-11-11 19:17:57,252 add: layers/projection_layer/W size=6346200 type=float32 device=None 2017-11-11 19:17:57,252 init: - LSTMLayer name=hidden_layer_1 inputs=[projection_layer] size=300 activation=tanh devices=[None] 2017-11-11 19:17:57,282 add: layers/hidden_layer_1/layer_input/W size=120000 type=float32 device=None 2017-11-11 19:17:57,854 add: layers/hidden_layer_1/step_input/W size=360000 type=float32 device=None 2017-11-11 19:17:57,855 add: layers/hidden_layer_1/layer_input/b size=1200 type=float32 device=None 2017-11-11 19:17:57,855 init: - FullyConnectedLayer name=hidden_layer_2 inputs=[hidden_layer_1] size=300 activation=tanh devices=[None] 2017-11-11 19:17:57,902 add: layers/hidden_layer_2/input/W size=90000 type=float32 device=None 2017-11-11 19:17:57,902 add: layers/hidden_layer_2/input/b size=300 type=float32 device=None 2017-11-11 19:17:57,903 init: - SoftmaxLayer name=output_layer inputs=[hidden_layer_2] size=63462 activation=tanh devices=[None] 2017-11-11 19:17:58,898 add: layers/output_layer/input/W size=19038600 type=float32 device=None 2017-11-11 19:17:58,900 add: layers/output_layer/input/b size=63462 type=float32 device=None 2017-11-11 19:17:58,900 init: Total number of model parameters: 26019762 Building optimizer. 2017-11-11 19:18:10,553 add: layers/hidden_layer_1/layer_input/b_sum_sqr_gradient size=1200 type=float32 device=None 2017-11-11 19:18:10,554 add: layers/hidden_layer_1/step_input/W_sum_sqr_gradient size=360000 type=float32 device=None 2017-11-11 19:18:10,555 add: layers/hidden_layer_1/layer_input/W_sum_sqr_gradient size=120000 type=float32 device=None 2017-11-11 19:18:10,555 add: layers/hidden_layer_2/input/W_sum_sqr_gradient size=90000 type=float32 device=None 2017-11-11 19:18:10,575 add: layers/projection_layer/W_sum_sqr_gradient size=6346200 type=float32 device=None 2017-11-11 19:18:10,575 add: layers/hidden_layer_2/input/b_sum_sqr_gradient size=300 type=float32 device=None 2017-11-11 19:18:10,576 add: layers/output_layer/input/b_sum_sqr_gradient size=63462 type=float32 device=None 2017-11-11 19:18:10,639 add: layers/output_layer/input/W_sum_sqr_gradient size=19038600 type=float32 device=None Building text scorer for cross-validation. Validation text: masked_validation_data.txt

senarvi commented 6 years ago

The model is very small, except the input and output layers. So maybe the vocabulary size is still the bottleneck. The easiest thing to try is hierarchical softmax (replace softmax with hsoftmax in the architecture). You can also try word classes, or sampling-based softmax (NCE / BlackOut cost).

amrmalkhatib commented 6 years ago

I'm trying hsoftmax now, but I forgot to mention that the sentences are long (between 400 and 500 words).

senarvi commented 6 years ago

Ok so then you have around half a billion words. That's already quite a lot so it's maybe not that surprising. You can also speed it up by using a larger batch size (e.g. 100) and shorter sequence length (e.g. 25). Recurrent networks cannot parallelize computation of the words in a sequence, but they can parallelize computation of different sequences.

amrmalkhatib commented 6 years ago

The larger batch size enhanced the speed, Thank you.

senarvi / theanolm

Training is very slow #35