Suggestions to tackle this MemoryError

hipoglucido commented 8 years ago

Hello, I am using ami-125b2c72 (g2.2xlarge) with spot price as you suggested in another issue (thanks a lot). After struggling a bit with CUDA drivers I finally got to run some epochs and I am able to save and load all the training files from S3. Now, I have 1441135 examples. I trained one epoch, saved weights, stop ami, re-run script, load weights, train one more epoch and then crashed. I got this output. I wonder if you, @udibr , could give me some ideas or intuitions about what is my problem. One of my questions is, is the memory error about regular RAM or GPU memory? (maybe I could use another AMI).I also got the warning about Epoch comprised more than samples_per_epoch samples, but I am not sure if I should do anything about it.

`ubuntu@ip-xxxxxx:~/auris$ python train2.py Using Theano backend. Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 4007) READING WORD EMBEDDING ('/home/ubuntu/auris//en3_vocabulary-embedding.pkl', ' already downloaded') ('/home/ubuntu/auris//en3_vocabulary-embedding.data.pkl', ' already downloaded') number of examples 1441135 1441135 dimension of embedding space for words 100 vocabulary size 40000 the last 10 words can be used as place holders for unknown/oov words total number of different words 1481094 1481094 number of words outside vocabulary which we can substitue using glove similarity 208580 number of words that will be regarded as unknonw(unk)/out-of-vocabulary(oov) 1232514 H: Vuwani schools damage prompts call for new law to punish vandals D: Department officials on Tuesday briefed MPs on the recovery plans for the protest-ravaged^ Vuwani area in Limpopo . Earlier in May Vuwani residents protested against the creation of a new municipality which will include Malamulele^ residents . The violent protests led to the torching of several schools and other public amenities in the area which disrupted classes . H: Kathmandu Post- Mitsubishi Motors admits cheating fuel tests since 1991 D: Mitsubishi 's eK^ Wagon^ was one of the models affected Reuters Apr 26 , 2016- Mitsubishi Motors has admitted to falsifying some fuel consumption tests since 1991 . The admission follows last week 's revelation that it had falsified fuel economy data for more than 600,000 vehicles sold in Japan . 'For the domestic market , we have been using that method since 1991 , ' said vice-president Ryugo^ Nakao^ at a press conference in Tokyo on Tuesday . MODEL 0 cls=Embedding name=embedding_1 40000x100 1 cls=LSTM name=lstm_1 100x512 512x512 512 100x512 512x512 512 100x512 512x512 512 100x512 512x512 512 2 cls=Dropout name=dropout_1

3 cls=LSTM name=lstm_2 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 4 cls=Dropout name=dropout_2

5 cls=LSTM name=lstm_3 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 512x512 512x512 512 6 cls=Dropout name=dropout_3

7 cls=SimpleContext name=simplecontext_1

8 cls=TimeDistributed name=timedistributed_1 944x40000 40000 9 cls=Activation name=activation_1

LOAD downloading train.hdf5 to /home/ubuntu/auris/train.hdf5: downloaded /home/ubuntu/auris/train.hdf5 Weights downloaded TEST .... ....

H: ~ Kate Beckinsale and teenage daughter text naked pictures of Michael Sheen to each other to <0>^ themselves up ' _ D: the Port Talbot actor to each other . The underworld star , who lives in the States with 17-year-old Lily , made the odd revelation TRAIN Iteration 0 Epoch 1/1 29952/30000 [============================>.] - ETA: 1s - loss: 7.7543/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py:1403: UserWarning: Epoch comprised more than 'samples_per_epoch' samples, which might affect learning results. Set 'samples_per_epoch' correctly to avoid this warning. 30016/30000 [==============================] - 746s - loss: 7.7548 - val_loss: 7.8132 ('Uploaded ', '/home/ubuntu/auris/train.history.pkl', ' succesfully') ('Uploaded ', '/home/ubuntu/auris/train.hdf5', ' succesfully') HEAD: A Python^ Bit^ A Man 's P DESC: The man fought to remove HEADS: 34.5337700502 Syrian , Attaporn^ Attaporn^ at , to 43.0280796466 Former wife.She^ : for hour.Eventually^ wife.She^ to in wife.She^ wife.She^ Iteration 1 Epoch 1/1 29952/30000 [============================>.] - ETA: 1s - loss: 7.7520Exception in thread Thread-4: Traceback (most recent call last): File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 754, in run self.__target(_self.args, *_self.__kwargs) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 404, in data_generator_task generator_output = next(generator) File "train2.py", line 498, in gen yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug) File "train2.py", line 459, in conv_seq_labels y = np.zeros((batch_size, maxlenh, vocab_size)) MemoryError

Traceback (most recent call last): File "train2.py", line 538, in nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 656, in fit_generator max_q_size=max_q_size) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1412, in fit_generator max_q_size=max_q_size) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1474, in evaluate_generator 'or (x, y). Found: ' + str(generator_output)) Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None ` And as always, thanks for giving us the oportunity of using state of the art machine learning techniques into our own projects :)

udibr commented 8 years ago

It looks like you run out of RAM on your Amazon instance. You can open a terminal and use htop to see this happening while the code is running. As a check reduce the size of your data set to half its size and see if it solves your problem

hipoglucido commented 8 years ago

Hello @udibr , you are right, I ran out of RAM on my AMI (15GB). I reduced the dataset to half-size (750K) but I got exactly the same error, at the same point. It is always at the end of 2nd iteration (epoch). The line is model.fit_generator(traingen, samples_per_epoch=nb_train_samples,nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples) when the status is like this 29952/30000 [============================>.], it seems like at the end of the execution of that line there is a peak in RAM usage (does it make sense to you?), but somehow the script survives the end of the first epoch and dies at the end of the second. (I have also checked that the upload of history.pkl and train.hdf5 to S3 is not the problem). I tried the script with g2.8xlarge AMI version (60GB RAM) and it doesn't crash. However, I had to close it because it is much more expensive than g2.2xlarge (15GB RAM) and I can't afford it. I would like to understand why is it that the bigger the dataset the higher the ram consumption. I expected it to be more or less the same but slower. Is it because of the vocabulary size? My idea was to train a model with a ~7M articles dataset I've collected. I hope there is a way I could make it without using the expensive g2.8xlarge AMI type.

For now I am going to run it again with 150K samples and see what happens.

Thanks :+1: :+1:

EDIT: with 150K samples I am getting the same error. I have also tried seeting nb_train_samples=15000 and nb_val_samples=1500 and still the same. I am confused :\

udibr commented 8 years ago

try training with nflips=0 it may help

hipoglucido commented 8 years ago

Ok. You mean traingen = gen(X_train, Y_train, batch_size=batch_size, nflips=0, model=model) right?

Thanks

udibr commented 8 years ago

thats ok or change the value assigned to nflips in cell 9 of train.ipynb

hipoglucido commented 8 years ago

Thanks, I tried it but unfortunately it crashed in the second iteration again. However, this time did not crash at the end of it:

TRAIN Iteration 0 Epoch 1/1 29952/30000 [============================>.] - ETA: 0s - loss: 7.4030/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py:1432: UserWarning: Epoch comprised more than 'samples_per_epoch' samples, which might affect learning results. Set 'samples_per_epoch' correctly to avoid this warning. 30016/30000 [==============================] - 532s - loss: 7.4035 - val_loss: 7.6468 Building historic... ... historic built ('Writting historic in ', '/home/ubuntu/auris/train.history.pkl', '...') ('Uploaded ', '/home/ubuntu/auris/train.history.pkl', ' succesfully') ('... historic saved in ', '/home/ubuntu/auris/train.history.pkl') ('Saving model weights in ', '/home/ubuntu/auris/train.hdf5', '...') ('Uploaded ', '/home/ubuntu/auris/train.hdf5', ' succesfully') ('... model weights saved in ', '/home/ubuntu/auris/train.hdf5') Generating samples... HEAD: Police bar tatooed^ peopl DESC: A bizarre video has emerg HEADS: 37.3261257818 The of of for after in after and ... samples generated Iteration 1 Epoch 1/1 384/30000 [..............................] - ETA: 531s - loss: 7.5555Exception in thread Thread-3: Traceback (most recent call last): File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 801, in bootstrap_inner self.run() File "/home/ubuntu/anaconda2/lib/python2.7/threading.py", line 754, in run self.__target(_self.args, *_self.__kwargs) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 416, in data_generator_task generator_output = next(generator) File "train.py", line 507, in gen yield conv_seq_labels(xds, xhs, nflips=nflips, model=model, debug=debug) File "train.py", line 468, in conv_seq_labels y = np.zeros((batch_size, maxlenh, vocab_size)) MemoryError 448/30000 [..............................] - ETA: 528s - loss: 7.5818Traceback (most recent call last): File "train.py", line 548, in nb_epoch=1, validation_data=valgen, nb_val_samples=nb_val_samples File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/models.py", line 661, in fit_generator max_q_size=max_q_size) File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1387, in fit_generator 'or (x, y). Found: ' + str(generator_output)) Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

I am going to try it with TensorFlow backend. Do you think it may be a good idea? If it doesn't resolve the memory problem, maybe it will be worth to use g2.8xlarge (4 GPUs), since TensorFlow supports parallel GPU (Theano doesn't. Yet). It is much more expensive but it should train faster...

hipoglucido commented 8 years ago

I set batch_size=32 and it stopped complaining (using theano)!

Thanks a lot for your help @udibr :)

udibr commented 8 years ago

I'm using Theano so there could be bugs related to TF which I did not noticed. I found that Theano runs on this problem about 5 times faster than TF and all you need to do is install Theano (and dont set Keras to use TF)

udibr / headlines

Suggestions to tackle this MemoryError #6