uyaseen / theano-recurrence

Recurrent Neural Networks (RNN, GRU, LSTM) and their Bidirectional versions (BiRNN, BiGRU, BiLSTM) for word & character level language modelling in Theano
MIT License
43 stars 24 forks source link
bidirectional-gru bidirectional-lstm bidirectional-rnn gru language-modeling lstm natural-language-processing rnn theano

theano-recurrence

This code implements Recurrent Neural Network (RNN, GRU and LSTM) and their Bidirectional versions (BiRNN, BiGRU, BiLSTM) in Python using Theano, the code is generic and therefore can be applied to any sequence modelling task, however, as an example I have applied these models on word & character level language modelling.

Dependencies

import nltk
packages = ['punkt']
nltk.download(packages)

Useful Links

Usage

The code structure and code iteself is fairly straight forward and could be understood in the first glimpse.

Data

All input data is stored inside the data/ directory. There is already a dummy data which is in-fact excerpts from 'Beyond Good and Evil by Friedrich Nietzsche'. Karpathy have provided few more datasets which are worth trying. If you'd like to use your own data then create a single file input.txt and place it in the data/ directory. For example, data/input.txt.

Switching between Word & Character Level Language Modelling

utilities\textreader.py provide methods to read both by 'character by character' read_char_data(..) or by 'word by word' read_word_data(..), for the given input data. Word level language modelling is usually more accurate, however, character level language modelling often generates more interesting words/patterns. Both of above methods need seq_length which determines the length of each stream i.e one training sample, therefore, it specifies the limit at which the gradients can propagate backwards in time and model cannot learn dependencies longer than the seq_length in number of characters/words.

Training

train.py provides a convenient method train(..) to train each model, you can select the recurrent model with the rec_model parameter, it is set to gru by default (possible options include rnn, gru, lstm, birnn, bigru & bilstm), number of hidden neurons in each layer (at the moment only single layer models are supported to keep the things simple, although adding more layers is very trivial) can be adjusted by n_h parameter in train(..), which by default is set to 100. As the model is trained it stores the current best state of the model i.e set of weights (best = least training error), the stored model is in the data\models\MODEL-NAME-best_model.pkl, also this stored model can later be used for resuming training from the last point or just for prediction/sampling. If you don't want to start training from scratch and instead use the already trained model then set use_existing_model=True in argument to train(..). Also optimization strategies can be specified to train(..) via optimizer parameter, currently supported optimizations are rmsprop, adam and vanilla stochastic gradient descent and can be found in utilities\optimizers.py. b_path, learning_rate, n_epochs in the train(..) specifies the 'base path to store model' (default = data\models\), 'initial learning rate of the optimizer', and 'number of epochs respectively'. During the training some logs (current epoch, sample, cross-entropy error etc) are shown on console to get an idea of how well learning is proceeding, logging frequency can be specified via logging_freq in the train(..). At the end of training, a plot of cross-entropy error vs # of iterations gives an overview of overall training process and is also stored in the b_path.

Sampling

python sample.py

One can sample from the model (during training or from the trained model) via model.generative_sampling(..) by providing the initial seed which could be a random element (word/character) from the vocabulary, emb_data which is just the embeddings of our vocabulary (in our case it's just one-hot-encoding) and sample_length which is the length of the sample it-self. Frequency of sampling can be specified via sampling_freq in the train(..).

Note: In theano the only efficient way of implementing sequence models is by using scan, which provides a very convenient interface to iterate over tensors, for training everything is good, however, difficulty arises when sampling from the model in-cases where output at every time-step is input to the next time step, in such cases we cannot use the scan we used for training because we have to call scan multiple times and each call to scan initializes the hidden-to-hidden state vector h0 from zero, which means that while sampling we are ignoring hidden state from previous steps which is very wrong, the ugly fix is to write another scan which will be executed only once per sample, let it run till sample_length and make hidden-state & output at every time-step recurrent by specifying them in the outputs_info, generative_sampling(..) does exactly this.

Few of the samples I got from GRU while training 'Beyond Good and Evil' (only 49.4 KB) for 600 epochs are:


. and with one's belief in his nothing else is personal something part of yourself--the secont cause and eton end minists in the heart of the sensation of the condition of same will there and to be an

 the seem of secise the  with their lee no matirst goanted and endoldge, one causa proud of have origin of a desister instinct of the superous not is precisely the hearsed the houmst and endon and ear e dlig is me the sone

byither however, let us be of the case of some ontuilly and always a suire, the sain, the entire and wosless in all seriousnel to be out imperpopsysible that it is unveittendes a

The samples might not look very impressive, however, notice it's just a character level language modelling (it is just reading one character at a time and it only read 50 characters at once), the model does learn some words, some combination of words and some linguistic patterns like closing of quotes, commas, full stop, new line etc. Also, I was only able to train on a very small sample of 49.4 KB, more data and longer sequence length will definitely results in more interesting samples/sentences.

And here's the error plot (the error was still decreasing but I had to stop training as I did not wanted to burden my poor laptop beyond it's capacity). GRU Error Plot

Note

If you are interested in minimal code, then browse to an older version of this repository.