keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.56k stars 19.42k forks source link

why lstm loss is NaN for pre-trained word2vec #1360

Closed liyi193328 closed 7 years ago

liyi193328 commented 8 years ago

I'm a theano and keras fresher, and want to learn them , which I think very interesting and helpful. The following question confuses me about for one week. But I can't work it out after try some ways mentioned before. I want to do sentiment analysis for texts to three classes. And I train word2vec(dim = 600) with gensim. My train data is 10475 sequences in different length. label shape is [10475,3] After setting maxlen of sequence 200, every sequence are converted to 200*600 2D array.If some sequence's length is less than 200, then the remaining values is filled with 0(padding), resulting some rows are all zeroes. And then I feed them into LSTM,

LSTM code as following:

    sgd = SGD(lr=0.001, decay = 1e-6, momentum=0.9, nesterov=True, clipnorm=0.3)
    rmsprop = RMSprop(clipnorm=0.1,epsilon=5e-04)
    adam = Adam(epsilon=1e-03,clipnorm=0.1)
    model = Sequential()

    model.add(LSTM(output_dim=300,input_length=200,input_dim=600)) 
#     model.add(Dropout(0.5))
    model.add(BatchNormalization(epsilon=1e-04))
    model.add(Dense(nb_classes))
    model.add(Activation('softmax'))
    model.compile(loss='mean_squared_error',
                  optimizer='adam', class_mode="categorical")

model.fit(train,label,batch_size=100,nb_epoch=4,verbose=1,shuffle=True,validation_split=0.1,show_accuracy=True)

But Getting:

loss: nan

Train on 9430 samples, validate on 1048 samples Epoch 1/4 9430/9430 [==============================] - 99s - loss: nan - acc: 0.2992 - val_loss: nan - val_acc: 0.1355 Epoch 2/4 9430/9430 [==============================] - 96s - loss: nan - acc: 0.2992 - val_loss: nan - val_acc: 0.1355 Epoch 3/4 9430/9430 [==============================] - 96s - loss: nan - acc: 0.2992 - val_loss: nan - val_acc: 0.1355 Epoch 4/4 1600/9430 [====>.........................] - ETA: 75s - loss: nan - acc: 0.3038

I test different optimizer,also improve epsilon value, set clipnorm(in optimizer above) and different loss functions('mean_squared_error', 'categorical_crossentropy') and so on, but failed.

Also in cpu or gpu mode, loss value is also nan.

Even I switch to Convolution2D:

    nb_feature_maps = 120
    n_gram = 10
    model.add(Convolution2D(nb_filter = nb_feature_maps, nb_row=n_gram, nb_col=600,input_shape=(1,200,600)))
    model.add(Activation('relu'))

    model.add(MaxPooling2D(pool_size=(maxlen - n_gram + 1, 1)))
    model.add(Dropout(0.25))

    model.add(Flatten())
    model.add(Dense(128))
    model.add(Activation('tanh'))
    model.add(Dropout(0.5))
    model.add(Dense(3))
    model.add(Activation('softmax'))

    model.compile(loss='mean_squared_error',
              optimizer='sgd', class_mode="categorical")

The loss values remain nan

Ways to solve?

So I'm wondering what's the real reason for the NaN loss value? How to solve or debug it? Is the word2vec data wrong , padding method wrong or other? If keras can't solve, I have to choose another deep learning package, or the reason is theano? what can I do then? please help.

jgc128 commented 8 years ago

Are you sure I got nan loss with categorical_crossentropy? What are your labels look like?

resulting some rows are all zeroes

You shouldn't have some rows are all zeros - LSTM takes as input a 3D tensor with shape (nb_samples, input_length, input_dim)

So, from the side view, your input data should look like this:

|0000xxxx|/
|00xxxxxx|/
|xxxxxxxx|/
|00000xxx|/

where / denotes the dimension of word2vec vectors

liyi193328 commented 8 years ago

@jgc128 Thanks to your reply and help. my label likes [[0,1,0], [1,0,0]....[0,0,1]], whose shape is (nb_samples, nb_classes), here nb_samples = 10475, nb_classes=3; But I'm not sure what's the meaning of your symbols. From my perspective, nb_samples is the number of all sequences. And every sequence is represented as a 2D array, every row is a word vector, the number of column is word2vec's dim(here is 600). And if I don't pad some all zeros rows, how to make sure every 2D array have same input_length, considering every sequence has different lengths? Thanks!

jgc128 commented 8 years ago

@liyi193328 sorry, I was unclear.

Yes, nb_samples is the number of all sequences (10475), input_length is the length of the longest sequence and input_dim is the dimension of word2vec vectors (600).

So the input matrix looks like this: keras_lstm_input

where View 1 is represented in my post above. 0's mean the zeros for padding (note we pad from the left), and x's are some vectors.

In the code it looks like this:

X = np.zeros((nb_samples, input_length, input_dim)) 
liyi193328 commented 8 years ago

@jgc128 Thanks. Wonderful details about input 3D array. More specifically, three sentences like: [ [ He, like, keras], [learning], [like, keras] ] the word vector(4 dim) each is: He -> [1,1,1,1], like->[2,2,2,2], keras->[3,3,3,3], learning->[5,5,5,5] then after padding ,the 3D array shape is (2,3,4), like: [ [ [1,1,1,1], [2,2,2,2],[3,3,3,3] ], [ [0,0,0,0], [0,0,0,0], [5,5,5,5] ], [ [0,0,0,0], [2,2,2,2], [3,3,3,3] ] ] if the specific example right? Thanks.

jgc128 commented 8 years ago

@liyi193328 Yes, it's right. It should work.

liyi193328 commented 8 years ago

@jgc128 Thanks. After padding like the example, I still get loss NaN. It's a little confusing. model.add(LSTM(output_dim=300,input_length=200,input_dim=600)) model.add(Dense(nb_classes)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', class_mode="categorical") Are there other tricks?

dandxy89 commented 8 years ago

Has your target array been formatted correctly?

Eg. One class per column ?

jgc128 commented 8 years ago

Can you show your code for creating the input data?

liyi193328 commented 8 years ago

@dandxy89 @jgc128 Thanks. Yes, my targel array.shape is (nsamples,nb_classes). One class per column (0,1,3). And I try your code in #853 (padd word index array with left zeros, add embedding layer ). Everything goes well and I get 66.1% acc for 3 classes with 8583 sequence.But embedding vectors cost much memory, resulting to GPU device memory can't allocate(only choose cpu). So I want to it faster with fix word vectors.

My code for input data is:

maxlen=200
def init(textDatapath='./allData.txt', word2vecPath='./word2vec',maxlen=200,nb_classes=3,updated=False,vecDataPath='./trainVec(part).pickle',newvecDataPath='./trainVec(new).pickle'):
    Train = list()
    seqTokens = list()
    if os.path.isfile(vecDataPath) and updated == False:
        ft = codecs.open(vecDataPath,"rb")
        print("find ", vecDataPath)
        D = pickle.load(ft)
        Train = D['train']
        Label = D['label']
        ft.close()
    else:
        print("begin to update", newvecDataPath)
        f = codecs.open(textDatapath, "r", "utf-8") #every line is a sequence
        lines = f.readlines()
        Label = []
        Textokens = []
        for line in lines:
            t = line.split("\t")  #t[0] is the target
            tokens = jieba.lcut(t[1])  #segment sequence, get a list of tokens 
            vec = []
            existToken = []
            minum = 5   #sequences having least 5 tokens can be considered
            for token in tokens:
                try:
                    vector = word2vec[token] #get vector of token by word2vec(gensim)
                    vec.append(vector)
                    existToken.append(token)
                except KeyError:
                    continue
            if len(vec) <= 5:
               continue
            else:
                s = np.array(vec)   #s is the sequence's 2D array
            Train.append(s)
            Label.append(int(t[0]))
            seqTokens.append(existToken)
        if updated == True:
            ft = codecs.open(newvecDataPath,"wb")
            print("dump to ",newvecDataPath)
            pickle.dump({'train':Train,'label':Label,'seqTokens':seqTokens},ft)
            ft.close()
    # Label and Train is a list, in Label every element is a scalar.
    #In my case, Label is -1,0,1, so it needs to plus 1 to become 0,1,2
    Label = np.array(Label,dtype='float32') + 1 
    Train = np.array(Train)  # in train every element is a numpy array.
    print("init finished!")
    return [Train,Label]

def padTrainData(Train,Label):
    print("pre train data...")
    Label = np.array(Label,dtype='float32')
    Label = np_utils.to_categorical(Label,nb_classes= 3)
    nsamples = Train.shape[0]
    train = np.empty((nsamples,maxlen,lstm_input_dim))
    for i in range(nsamples):
        t = Train[i]
        (tokens,dim)=t.shape
        if tokens < maxlen:
            #s is the empty array
            s = np.empty((maxlen-tokens, 600))
           #combine s and t
            train[i] = np.concatenate( (s,t),axis=0)
        else:
            train[i] = np.array(t[0:maxlen])
    Train = np.array(train,dtype='float64')
    return [Train,Label]

train = None
label = None
train,label = init(vecDataPath='./trainVec(new).pickle',updated=True)
train,label = padTrainData(train,label)
print("train shape:",train.shape)
print("label shape:",label.shape)
The results is:
begin to update ./trainVec(new).pickle
dump to  ./trainVec(new).pickle
init finished!
pre train data...
train shape: (8583, 200, 600)
label shape: (8583, 3)

In my code, first step is get train data , the second is getting target label array, padding train data with zeros Thanks to help me with great patience.

jgc128 commented 8 years ago

One problem is you are using np.empty but it does not initialize the array with zeros (see documentation). Try to use np.zeros instead.

It should not give the nan loss though.. Have you tried to see what is the output of the network?

liyi193328 commented 8 years ago

@jgc128 Thank you very much. Everything goes fine when I change np.empty to np.zeros ! It's all my mistake.sorry when use np.empty to init a array, the value may be two large or too small, resulting NaN? Another question is how to check the output of the network? use model.predict_proba , theano function or other ways? Thanks with sincerely!

jgc128 commented 8 years ago

Excellent!

You can use something like this classes = model.predict_classes(X_test, batch_size=32). See Getting started: 30 seconds to Keras for details

liyi193328 commented 8 years ago

@jgc128 Thanks.I'll dive into it.

William-Stocks commented 5 years ago

why don`t use keras.preprocessing.sequence.pad_sequences ? data= list() for individual in len( --): express_matrix = individual.express_individual_times() # 每个样本返回二维矩阵,N * 256; data.append(express_matrix) train_matrix = sequence.pad_sequences(data,padding='post', maxlen=40)

I checked the data, and make sure it pads OK, but still get "loss = NAN" for several sample. I wonder if I should delete these samples when they looks like very normal.

simha1214 commented 5 years ago

Hi I am using np.zeros() only but still getting very less accuracy around 37% on 900 samples for a 30 class classification. I used tanh as activation function before softmax layer. all suggestions are welcomed.

My code is as follows :

def build_matrix(word_index):

embedding_index = load_embeddings(path)

embedding_matrix = np.zeros((len(word_index) + 1, 100))
unknown_words = []

for word, i in word_index.items():
    try:
        embedding_matrix[i] = w2v_model[word]
    except KeyError:
        unknown_words.append(word)
return embedding_matrix

embedding_matrix=build_matrix(tokenizer.word_index)

model = Sequential() model.add(Embedding(max_features,embedding_matrix.shape[1], weights=[embedding_matrix],input_length=MAX_LEN,trainable=False)) model.add(SpatialDropout1D(0.3)) model.add(LSTM(LSTM_UNITS,activation='relu',return_sequences=True)) model.add(LSTM(LSTM_UNITS)) model.add(Dropout(0.5))

model.add(LSTM(100))

model.add(Dense(4LSTM_UNITS,input_shape=(1000,),activation='relu')) model.add(Dropout(0.5)) model.add(Dense(4LSTM_UNITS,activation='tanh')) model.add(Dense(30, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())

history=model.fit(x_train, y_train, nb_epoch=11, batch_size=64,validation_data=(x_test,y_test))

Final evaluation of the model

scores = model.evaluate(x_test, y_test, verbose=0) print("Accuracy: %.2f%%" % (scores[1]*100))