senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

TheanoLM is not using GPU or Multi-Thread and Suddenly killed by kernel. #28

Closed DemoVersion closed 7 years ago

DemoVersion commented 7 years ago

Hi, I've set the THEANO_FLAG export THEANO_FLAGS=floatX=float32,device=cuda0 and also pyGPU tests were successful but TheanoLM is not using GPU during training time. then after having no progress for about two hours of training It was killed by the linux kernel for extreme resource starvation

I've tried training multiple time, for small amount of data and CPU training training was successful but for real data ( about 3gb) and using classes with GPU processing result was failure every time.

Reading vocabulary from 3gd.classes.txt.

Computing class membership probabilities from unigram word count.

Number of words in vocabulary: 293317

Number of word classes: 1002

Creating trainer.

Computing unigram probabilities and the number of mini-batches in training data.

Building neural network.

/home/nlpuser/.local/lib/python3.5/site-packages/theano/sandbox/rng_mrg.py:1522: UserWarning: 
MRG_RandomStreams.multinomial_wo_replacement() is deprecated and will be removed in the next release of Theano. Please use MRG_RandomStreams.choice() instead.
  warnings.warn('MRG_RandomStreams.multinomial_wo_replacement() is '
/home/nlpuser/.local/lib/python3.5/site-packages/theano/sandbox/rng_mrg.py:1522: UserWarning: 
MRG_RandomStreams.multinomial_wo_replacement() is deprecated and will be removed in the next release of Theano. Please use MRG_RandomStreams.choice() instead.
  warnings.warn('MRG_RandomStreams.multinomial_wo_replacement() is '

Compiling optimization function.

Building text scorer for cross-validation.

Validation text: bnm_test_1000.txt

Training neural network.

2017-05-07 15:23:38,821 _log_update: [1000] (0.1 %) of epoch 1 -- lr = 0.1, duration = 222.7 ms

Killed

Could anyone tell me what’s wrong with this language model training? in case it’s limited by TheanoLM features could you recommend me somewhere to start modifying the code?

senarvi commented 7 years ago

Hi.

In the log that you pasted, it says "from unigram word count." while it should say "from unigram word counts.":

https://github.com/senarvi/theanolm/blob/master/theanolm/commands/train.py#L234

I started thinking if it's because you have an old version of TheanoLM with a typo in the sentence? I'm not sure whether that would matter, but better to use the most recent version just in case. Also make sure that you're using the latest Theano.

I haven't encountered the exact problem that you describe, but whenever I've seen Linux killing a process like that, it has been because it allocates too much memory. That could happen when you use a very large batch size and/or network. What is the network architecture you're using?

The training does in fact progress between 1000 and 2000 mini-batch updates. One update takes more than 200 ms, which is very slow. That would indeed indicate that it's not using GPU. Also, when the program starts, Theano should print something like this:

Using cuDNN version 5103 on context None
Mapped name None to device cuda0: Tesla K80 (0000:06:00.0)

Whether Theano uses CPU or GPU is by no means controlled by TheanoLM. Have you tried running this test program from Theano documentation for testing whether GPU is used:

http://deeplearning.net/software/theano/tutorial/using_gpu.html

I don't find any problem from the flags from what you wrote, but in the second line of your message you misspell it THEANO_FLAG. Could that be the problem?

If you still have problems, could you run TheanoLM with --log-level=debug. Maybe I'll get more ideas from more verbose output.

DemoVersion commented 7 years ago

Thank you for your prompt response to the issue.

Fortunately, problem solved. Details below are for anyone experiencing the same problem.

I was setting the $THEANO_FLAGS right. But running the TheanoLM in sudo mode that makes the flag unreachable. In case you want to run TheanoLM as root, switch to root and set environment variables there.

The cuDNN library path should be added to $CPATH $PATH $LD_LIBRARY_PATH $LIBRARY_PATH You can do it by executing commands below, (add this commands to bashrc to make this changes permanent)

export PATH=$PATH:usr/local/cuda-8.0/bin
export CPATH=$CPATH:/usr/local/cuda-8.0/include/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:usr/local/cuda-8.0/bin:/usr/local/cuda-8.0/lib64/
export LIBRARY_PATH=$LIBRARY_PATH:usr/local/cuda-8.0/bin:/usr/local/cuda-8.0/lib64/