lisa-groundhog / GroundHog

Library for implementing RNNs with Theano
BSD 3-Clause "New" or "Revised" License
598 stars 229 forks source link

Error running DT_RNN_Tut.py #22

Closed ndronen closed 9 years ago

ndronen commented 9 years ago

Since generate.py is supposed to create npz files that are compatible with the tutorial scripts, I was expecting the following comands to work (after putting input_chars.npz and input_chars_dict.npz in the appropriate place in DT_RNN_Tut.py, of course).

python generate --dest input_chars --level chars PATH_TO_TEXT_COMPRESSION_BENCHMARK
python DT_RNN_Tut.py

But no workie. See below for details about what I'm doing and what I'm seeing. I have fresh versions of Theano and GroundHog from github.

Is this a pilot error? Should I change something else in DT_RNN_Tut.py? The error IndexError: index 69 is out of bounds for size 50 is related to the number of dimensions of the embedding layer. If I change state['n_in'] from 50 to 51, the IndexError message changes accordingly:

# declare the dimensionalies of the input and output
if state['chunks'] == 'words':
    state['n_in'] = 10000
    state['n_out'] = 10000
else:
    state['n_in'] = 50
    state['n_out'] = 50
train_data, valid_data, test_data = get_text_data(state)

Similarly, if I switch from chars to words, the error becomes IndexError: index 33223 is out of bounds for size 10000, reflecting the dimensionality of the word embeddings.

Thanks!

I have train, valid, and test files from enwiki8 from http://mattmahoney.net/dc/textdata.html.

$ wc -l ~/proj/benchmarks/large-text-compression/{train,test,valid}
   44843 /home/ndronen/proj/benchmarks/large-text-compression/train
   36655 /home/ndronen/proj/benchmarks/large-text-compression/test
   44843 /home/ndronen/proj/benchmarks/large-text-compression/valid
  126341 total

Running generate.py results in no errors, and the files input_chars.npz and input_chars_dict.npz are created.

$ python generate.py --dest input_chars --level chars ~/proj/benchmarks/large-text-compression/
Constructing the vocabulary ..
 .. sorting words
 .. shrinking the vocabulary size
EOL 0
Constructing train set
Constructing valid set
Constructing test set
Saving data
... Done
$ file input_chars*
input_chars_dict.npz: Zip archive data, at least v2.0 to extract
input_chars.npz:      Zip archive data, at least v2.0 to extract
$ git diff DT_RNN_Tut.py 
diff --git a/tutorials/DT_RNN_Tut.py b/tutorials/DT_RNN_Tut.py
index e6e83d8..c17d425 100644
--- a/tutorials/DT_RNN_Tut.py
+++ b/tutorials/DT_RNN_Tut.py
@@ -298,8 +298,10 @@ if __name__=='__main__':
     state = {}
     # complete path to data (cluster specific)
     state['seqlen'] = 100
-    state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
-    state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+    #state['path']= "/data/lisa/data/PennTreebankCorpus/pentree_char_and_word.npz"
+    state['path']= 'input_chars.npz'
+    #state['dictionary']= "/data/lisa/data/PennTreebankCorpus/dictionaries.npz"
+    state['dictionary']= 'input_chars_dict.npz'
     state['chunks'] = 'chars'
     state['seed'] = 123
$ python DT_RNN_Tut.py
Using gpu device 0: GeForce GTX TITAN Black
data length is  9979512
data length is  9979512
data length is  8838862
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/scan_module/scan_perform_ext.py:133: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *
/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/sandbox/rng_mrg.py:1195: UserWarning: MRG_RandomStreams Can't determine #streams from size (Elemwise{Cast{int32}}.0), guessing 60*256
  nstreams = self.n_streams(size)
Constructing grad function
Compiling grad function
took 0.283576965332
Validation computed every 1000
GPU status : Used 110.398 Mb Free 6033.414 Mb,total 6143.812 Mb [context start]
Saving the model...
Model saved, took 0.161453008652
Traceback (most recent call last):
  File "DT_RNN_Tut.py", line 418, in 
    jobman(state, None)
  File "DT_RNN_Tut.py", line 293, in jobman
    main.main()
  File "/home/ndronen/proj/GroundHog/groundhog/mainLoop.py", line 293, in main
    rvals = self.algo()
  File "/home/ndronen/proj/GroundHog/groundhog/trainer/SGD_momentum.py", line 159, in __call__
    rvals = self.train_fn()
  File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 605, in __call__
    self.fn.thunks[self.fn.position_of_error])
  File "/home/ndronen/.local/lib/python2.7/site-packages/Theano-0.6.0-py2.7.egg/theano/compile/function_module.py", line 595, in __call__
    outputs = self.fn()
IndexError: index 69 is out of bounds for size 50
Apply node that caused the error: AdvancedSubtensor1(Elemwise{add,no_inplace}.0, x)
Inputs types: [TensorType(float32, matrix), TensorType(int64, vector)]
Inputs shapes: [(50, 400), (100,)]
Inputs strides: [(1600, 4), (8,)]
Inputs scalar values: ['not scalar', 'not scalar']

Backtrace when the node is created:
  File "/home/ndronen/proj/GroundHog/groundhog/utils/utils.py", line 177, in dot
    return matrix[inp]

Debugprint of the apply node:
AdvancedSubtensor1 [@A]  ''
 |Elemwise{add,no_inplace} [@B]  ''
 | |HostFromGpu [@C]  ''
 | | |W_0_emb_words [@D] 
 | |HostFromGpu [@E]  ''
 |   |noise_W_0_emb_words [@F] 
 |x [@G] 
ndronen commented 9 years ago

I've started getting similar, unexpected theano errors running simple pylearn2 scripts. I'm going to close this as likely either a theano bug or a hardware problem.

ndronen commented 9 years ago

My mistake. The errors from the pylearn2 scripts were unrelated to this problem. I'll reopen this issue.

rizar commented 9 years ago

The 'n_in' and 'n_out' state attributes stand for the numbers of possible inputs/outputs. The values 50 and 10000 correspond to the penntreebank dataset. If you use another data, these numbers should be changed accordingly.

ndronen commented 9 years ago

Thanks. It runs now. For some reason that possibility didn't register in my mind at all. I had been thinking those options were related to the dimensions of the hidden layers.