keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.48k forks source link

How does keras model.fit() work on LSTM training #2701

Closed twangnh closed 7 years ago

twangnh commented 8 years ago

Hi! Recently I'm using the keras LSTM, but I'm puzzled about the implementation detail of model.fit() function, how does it work for training on LSTM, let me put it concretely: In the official example of LSTM for text generation below:

`'''Example script to generate text from Nietzsche's writings. At least 20 epochs are required before the generated text starts sounding coherent. It is recommended to run this script on GPU, as recurrent networks are quite computationally intensive. If you try this script on new data, make sure your corpus has at least ~100k characters. ~1M is better. '''

from future import print_function from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers.recurrent import LSTM

from keras.utils.data_utils import get_file

import numpy as np import random import sys

path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")

text = open("C:\SciSoft\my_code\train.txt").read().lower() print('corpus length:', len(text))

chars = set(text) print('total chars:', len(chars)) char_indices = dict((c, i) for i, c in enumerate(chars)) indices_char = dict((i, c) for i, c in enumerate(chars))

cut the text in semi-redundant sequences of maxlen characters

maxlen = 40 step = 3 sentences = [] next_chars = [] for i in range(0, len(text) - maxlen, step): sentences.append(text[i: i + maxlen]) next_chars.append(text[i + maxlen]) print('nb sequences:', len(sentences))

print('Vectorization...') X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) y = np.zeros((len(sentences), len(chars)), dtype=np.bool) for i, sentence in enumerate(sentences): for t, char in enumerate(sentence): X[i, t, char_indices[char]] = 1 y[i, char_indices[next_chars[i]]] = 1

build the model: 2 stacked LSTM

print('Build model...') model = Sequential() model.add(LSTM(200, return_sequences=False, input_shape=(maxlen, len(chars))))

model.add(Dropout(0.2))

model.add(LSTM(512, return_sequences=False))

model.add(Dropout(0.2)) model.add(Dense(len(chars))) model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

def sample(a, temperature=1.0):

helper function to sample an index from a probability array

a = np.log(a) / temperature
a = np.exp(a) / np.sum(np.exp(a))
return np.argmax(np.random.multinomial(1, a, 1))

train the model, output generated text after each iteration

for iteration in range(1, 60): print() print('-' * 50) print('Iteration', iteration) model.fit(X, y, batch_size=128, nb_epoch=1)

start_index = random.randint(0, len(text) - maxlen - 1)

for diversity in [0.2, 0.5, 1.0, 1.2]:
    print()
    print('----- diversity:', diversity)

    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)

    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()`

there is "model.fit(X, y, batch_size=128, nb_epoch=1) ", X is sentences, y is the next char corresponding to each sentences of X, but in standard backpropagation, the target y is a sequence as long as the input X, like I showed below:(it's from character-level rnn by Andrej karpathy) inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]] so how does keras model.fit() work for rnn training ? hope someone can help me, thanks in advance!

carlthome commented 8 years ago

Could you clarify your particular question?

twangnh commented 8 years ago

I mean, in the training of LSTM or RNN, we use the standard forward and backword propogation, in each forward and backward propagation, the training input X and the target Y are two sequeces of the same length, but in keras example of lstm for text generation, in the model.fit() function, for each pair of data, the X is a sequence of letters, but the Y is only a letter, so I'm confused about this, I don't know if it works as the standard way. Thank you

carlthome commented 8 years ago

The RNN (a LSTM is a specific type of RNN) in the example outputs the last timestep into a fully-connected layer with as many outputs as there are distinct characters, on which it performs a softmax. The loss is categorical so this is a classification problem, essentially mapping the input text sequence to the most likeliest following character. Makes sense?

carlthome commented 8 years ago

Also, please use code highlighting when posting examples.

twangnh commented 8 years ago

Thank you, but I mean the way keras calculate the forward and backpropagation, just look an example, here is a standard rnn code it's written by Andrej karpathy, you can find it on github: **`""" Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy) BSD License """ import numpy as np

data I/O

data = open('yahoo.txt', 'r').read() # should be simple plain text file chars = list(set(data)) data_size, vocab_size = len(data), len(chars) print 'data has %d characters, %d unique.' % (data_size, vocab_size) char_to_ix = { ch:i for i,ch in enumerate(chars) } ix_to_char = { i:ch for i,ch in enumerate(chars) }

hyperparameters

hidden_size = 100 # size of hidden layer of neurons seq_length = 40 # number of steps to unroll the RNN for learning_rate = 1e-1

model parameters

Wxh = np.random.randn(hidden_size, vocab_size)_0.01 # input to hidden Whh = np.random.randn(hidden_size, hidden_size)_0.01 # hidden to hidden Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output bh = np.zeros((hidden_size, 1)) # hidden bias by = np.zeros(( vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0

forward pass

for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

backward pass: compute gradients going backwards

dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n): """ sample a sequence of integers from the model h is memory state, seed_ix is seed letter for first time step """ x = np.zeros((vocab_size, 1)) x[seed_ix] = 1 ixes = [] for t in xrange(n): h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh) y = np.dot(Why, h) + by p = np.exp(y) / np.sum(np.exp(y)) ix = np.random.choice(range(vocab_size), p=p.ravel()) x = np.zeros((vocab_size, 1)) x[ix] = 1 ixes.append(ix) return ixes

n, p = 0, 0 mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0 while True:

prepare inputs (we're sweeping from left to right in steps seq_length long)

if p+seq_length+1 >= len(data) or n == 0: hprev = np.zeros((hidden_size,1)) # reset RNN memory p = 0 # go from start of data v=np.random.randint(p,p+seq_length) inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

sample from the model now and then

if n % 100 == 0: sample_ix = sample(hprev, inputs[v], 200) txt = ''.join(ix_to_char[ix] for ix in sample_ix) print '----\n %s \n----' % (txt, )

forward seq_length characters through the net and fetch gradient

loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev) smooth_loss = smooth_loss * 0.999 + loss * 0.001 if n % 100 == 0: print 'iter %d, loss: %f' % (n, loss) # print progress

perform parameter update with Adagrad

for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], [dWxh, dWhh, dWhy, dbh, dby], [mWxh, mWhh, mWhy, mbh, mby]): mem += dparam * dparam param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

p += seq_length # move data pointer n += 1 # iteration counter\

` as you can see, the training pairs is :

inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]] targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

but in keras model.fit(). for each pair of data, the X is a sequence of letters, but the Y is only a letter,I don't know if keras does the same way for forward and backpropagation, and how? Thank you!

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.