senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

scoring error #41

Closed sameerkhurana10 closed 6 years ago

sameerkhurana10 commented 6 years ago

hi,

i am getting the following error while calculating ppl on the test set:

Mapped name None to device cuda: GeForce GTX TITAN Black (0000:03:00.0)
2018-04-09 11:44:37,415 exception_handler: An unexpected KeyError exception occurred: 'Unable to get link info (bad symbol table node signature)'
Traceback will be written to debug log (enable with --log-level debug).
srun: error: sls-titan-0: task 0: Exited with exit code 2
(theano-lm) sameerk@sls-415-1:/data/sls/qcri/asr/sameer_v1/asr/kaldi-forked/kaldi/egs/mit_qcri/s5_language_modeling/theanolm/recipes/arabic$ srun -p gpu --gres=gpu:1 theanolm score exp/blstm256_voc80k_blstm/nnlm.h5 data/rnnlm_data_all/test.dat --output perplexity --log-level debug
2018-04-09 12:38:40,288 get_default_device: Context None device="GeForce GTX TITAN Black" ID="0000:03:00.0"
2018-04-09 12:38:40,291 from_file: Reading vocabulary from network state.
/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX TITAN Black (0000:03:00.0)
2018-04-09 12:38:40,292 exception_handler: An unexpected KeyError exception occurred: 'Unable to get link info (bad symbol table node signature)'
Traceback will be written to debug log (enable with --log-level debug).
2018-04-09 12:38:40,293 exception_handler: Traceback:
2018-04-09 12:38:40,339 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/bin/theanolm", line 147, in <module>
    main()
2018-04-09 12:38:40,339 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/bin/theanolm", line 88, in main
    args.command_function(args)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/commands/score.py", line 114, in score
    default_device=default_device)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/network/network.py", line 280, in from_file
    vocabulary = Vocabulary.from_state(state)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/vocabulary/vocabulary.py", line 289, in from_state
    if 'words' not in h5_vocabulary:
2018-04-09 12:38:40,340 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,340 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/h5py/_hl/group.py", line 319, in __contains__
    return self._e(name) in self.id
2018-04-09 12:38:40,340 exception_handler: File "h5py/h5g.pyx", line 441, in h5py.h5g.GroupID.__contains__
2018-04-09 12:38:40,340 exception_handler: File "h5py/h5g.pyx", line 442, in h5py.h5g.GroupID.__contains__
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/h5g.pyx", line 511, in h5py.h5g._path_valid
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/h5l.pyx", line 212, in h5py.h5l.LinkProxy.exists
srun: error: sls-titan-0: task 0: Exited with exit code 2

score command:

srun -p gpu --gres=gpu:1 theanolm score exp/blstm256_voc80k_blstm/nnlm.h5 data/rnnlm_data_all/test.dat --output perplexity --log-level debug

train command:

theanolm train exp/blstm256_voc80k_blstm/nnlm.h5 --training-set data/rnnlm_data_all/transcript.dat --vocabulary data/rnnlm_data_all/input_80000.vocab --vocabulary-format words --sequence-length 25 --batch-size 32 --optimization-method adagrad --stopping-criterion no-improvement --cost cross-entropy --learning-rate 1 --gradient-decay-rate 0.9 --numerical-stability-term 1e-6 --num-noise-samples 1 --noise-distribution unigram --noise-dampening 0.5 --validation-frequency 1 --patience 0 --min-epochs 1 --max-epochs 15 --random-seed 1 --log-level debug --log-interval 200 --gradient-normalization 5 --architecture ../architectures/word-blstm256.arch --validation-file data/rnnlm_data_all/dev.dat

just checking the size of the model:

161k. looks suspiciously small.

What does this error mean?

senarvi commented 6 years ago

It seems that the model is corrupted. Looks like the HDF5 library throws a KeyError when trying to read the vocabulary from the model. So the problem is in training, not scoring. Is there something suspicious in the train log?

sameerkhurana10 commented 6 years ago

probably right. Other models are fine. I got bus error for this model. I think nothing to do with TheanoLM.