Training does not start

acDante commented 5 years ago

Hi, Luheng: Thanks for your great work! I encountered some strange errors during training. I used the following to start training your model : python python/train.py --config=./config/srl_config.json --model=./output --train=./sample_data/sentences_with_gold.txt --dev=./sample_data/sentences_with_gold.txt --task=srl

And I got these outputs in the terminal:

/scratch/users/duxi/miniconda3/envs/deep_srl/lib/python2.7/site-packages/theano/gpuarray/dnn.py:135: UserWarning: Your cuDNN version is more recent than Theano. If you encounter problems, try updating Theano or downgrading cuDNN to version 5.1. warnings.warn("Your cuDNN version is more recent than " Using cuDNN version 6021 on context None Mapped name None to device cuda: GeForce GTX TITAN X (0000:04:00.0) Task: srl Embedding size=100 Extracting features Extraced 19 words and 9 tags Max training sentence length: 9 Max development sentence length: 9 Warning: not using official gold predicates. Not for formal evaluation. Dev data has 1 batches. Data loading duration was 0:00:14. [WARNING] Log directory ./output is not empty, previous checkpoints might be overwritten Preparation duration was 0:00:00. Using 2 feature types, projected output dim=200. ('lstm_0_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5782bab50> ('lstm_1_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5782620d0> ('lstm_2_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe56bf82c10> ('lstm_3_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe570087f90> ('lstm_4_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe5781912d0> ('lstm_5_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe570090f90> ('lstm_6_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe57809b590> ('lstm_7_rdrop', 0.1, True) <neural_srl.theano.layer.HighwayLSTMLayer object at 0x7fe578203f90> embedding_0 embedding_0 [ 19 100] embedding_1 embedding_1 [ 2 100] lstm_0_W lstm_0_W [ 200 1800] lstm_0_U lstm_0_U [ 300 1500] lstm_0_b lstm_0_b [1800] lstm_1_W lstm_1_W [ 300 1800] lstm_1_U lstm_1_U [ 300 1500] lstm_1_b lstm_1_b [1800] lstm_2_W lstm_2_W [ 300 1800] lstm_2_U lstm_2_U [ 300 1500] lstm_2_b lstm_2_b [1800] lstm_3_W lstm_3_W [ 300 1800] lstm_3_U lstm_3_U [ 300 1500] lstm_3_b lstm_3_b [1800] lstm_4_W lstm_4_W [ 300 1800] lstm_4_U lstm_4_U [ 300 1500] lstm_4_b lstm_4_b [1800] lstm_5_W lstm_5_W [ 300 1800] lstm_5_U lstm_5_U [ 300 1500] lstm_5_b lstm_5_b [1800] lstm_6_W lstm_6_W [ 300 1800] lstm_6_U lstm_6_U [ 300 1500] lstm_6_b lstm_6_b [1800] lstm_7_W lstm_7_W [ 300 1800] lstm_7_U lstm_7_U [ 300 1500] lstm_7_b lstm_7_b [1800] softmax_W softmax_W [300 9] softmax_b softmax_b [9]

After these output, I never got other terminal output and the file "./output/checkpoints.tsv" remains empty even after the training is started for a long time. It seems the training does not make any progress at all. I am not sure if this is a GPU-specific issue: I am using cuda8.0 + cudnn8.0 and here is my theano configuration file:

[global] device = cuda floatX = float64 mode = FAST_RUN

[cuda] root=/usr/local/cuda-8.0/

[dnn] enable=True include_path=/usr/local/cuda-8.0/include library_path=/usr/local/cuda-8.0/lib64

[lib] cnmem = 0.8

[nvcc] fastmath = True

[gcc] cxxflags=-Wno-narrowing ~

Could you give me any ideas about the potential reason ?

Huijun-Cui commented 5 years ago

hi do you solve this problem , I also account this problem ,

acDante commented 5 years ago

I guess this problem is caused by incompatible GPU configuration, but I did not solve this problem .. I would suggest you to take a look at allennlp. This model is also implemented there.

luheng / deep_srl

Training does not start #17