keras-team / keras-contrib

Keras community contributions
MIT License
1.58k stars 651 forks source link

CRF layer negative loss #253

Open mdetournay opened 6 years ago

mdetournay commented 6 years ago

Hi,

I am training an LSTM - CRF network for named entity recognition. When using crf.loss_function, I'm getting negative losses after a few epochs. Before I go on giving more details about my code, is this even possible with this crf.loss_function? I tried looking at the code but because of it's recursive nature I'm having trouble identifying where I could be getting negative values.

Moreover, when training without the CRF layer, but with Time-distributed dense (softmax) and categorical cross entropy, losses stay positive and training works correctly.

What could I investigate to identify where the problem is?

Thank you very much for your help!

Cheers

Martin

Here is the network:

input=Input(shape=(170,))
        x=Embedding(input_dim=len(TrainingObj.keras_tokenizer.word_index)+1,output_dim=100,weights=[TrainingObj.embedding_matrix],
                    mask_zero=False,input_length=170,trainable=False)(input)
        x=Bidirectional(GRU(units=64,return_sequences=True,dropout=0.1),merge_mode='concat')(x)
        x=TimeDistributed(Dense(64, activation='relu'))(x)

        crf=CRF(num_classes,sparse_target=False)
        out=crf(x)

        self.model=Model(inputs=input,outputs=out)
        if restart:
            self.load_model()

        self.model.compile(optimizer='adam', loss=crf.loss_function, metrics=[crf.accuracy])

Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 170) 0


embedding_1 (Embedding) (None, 170, 100) 22203800


bidirectional_1 (Bidirection (None, 170, 128) 63360


time_distributed_1 (TimeDist (None, 170, 64) 8256


crf_1 (CRF) (None, 170, 3) 210

Total params: 22,275,626 Trainable params: 71,826 Non-trainable params: 22,203,800

Euruson commented 6 years ago

Is your backend Theano? Maybe you can see #131.

lzfelix commented 6 years ago

I've observed the same behaviour under TensorFlow. Package versions:

keras-contrib==2.0.8
Keras==2.2.0
tensorflow==1.8.0

The odd thing is that accuracy keeps going up and the model is indeed learning to predict from the data. What I find strange is that on the join mode, the CRF minimizes the log-likelihood, which, as far as I know, is a positive function, since it's the negative of the log of a probability.

Some logs from the training:

3541
Epoch 1/10
457/457 [==============================] - 30s 65ms/step - loss: 0.7630 - acc: 0.7884 - val_loss: 0.4350 - val_acc: 0.8624

Epoch 00001: val_loss improved from inf to 0.43499, saving model to ./models/model.hdf5
Epoch 2/10
457/457 [==============================] - 28s 60ms/step - loss: 0.3308 - acc: 0.8854 - val_loss: 0.2464 - val_acc: 0.9071

Epoch 00002: val_loss improved from 0.43499 to 0.24643, saving model to ./models/model.hdf5
Epoch 3/10
457/457 [==============================] - 28s 60ms/step - loss: 0.1710 - acc: 0.9122 - val_loss: 0.1427 - val_acc: 0.9204

Epoch 00003: val_loss improved from 0.24643 to 0.14268, saving model to ./models/model.hdf5
Epoch 4/10
457/457 [==============================] - 27s 60ms/step - loss: 0.0672 - acc: 0.9243 - val_loss: 0.0708 - val_acc: 0.9257

Epoch 00004: val_loss improved from 0.14268 to 0.07085, saving model to ./models/model.hdf5
Epoch 5/10
457/457 [==============================] - 27s 60ms/step - loss: -0.0142 - acc: 0.9310 - val_loss: 0.0106 - val_acc: 0.9297

Epoch 00005: val_loss improved from 0.07085 to 0.01058, saving model to ./models/model.hdf5
Epoch 6/10
457/457 [==============================] - 28s 60ms/step - loss: -0.0833 - acc: 0.9356 - val_loss: -0.0448 - val_acc: 0.9327

@linxihui, could you please comment on this issue? By the way, thanks for the implementation.

sharonrapoport commented 6 years ago

I have the same issue, did you end up figuring this out?

lzfelix commented 6 years ago

Not yet, as I’m masking my inputs... But it should be fixed..

On Fri, 3 Aug 2018 at 16:07 Sharon Rapoport notifications@github.com wrote:

I have the same issue, did you end up figuring this out?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras-contrib/issues/253#issuecomment-410348190, or mute the thread https://github.com/notifications/unsubscribe-auth/AIi3X25tKMzKMHjhGTpEXsxBK3u-ZxxVks5uNJ9XgaJpZM4UE3Qb .

lzfelix commented 6 years ago

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

On Fri, 3 Aug 2018 at 17:14 Luiz Felix lzcfelix@gmail.com wrote:

Not yet, as I’m masking my inputs... But it should be fixed..

On Fri, 3 Aug 2018 at 16:07 Sharon Rapoport notifications@github.com wrote:

I have the same issue, did you end up figuring this out?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras-contrib/issues/253#issuecomment-410348190, or mute the thread https://github.com/notifications/unsubscribe-auth/AIi3X25tKMzKMHjhGTpEXsxBK3u-ZxxVks5uNJ9XgaJpZM4UE3Qb .

EvilPort2 commented 6 years ago

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

@lzfelix How do I do that? I am new to this thing. Currently this is my code.

def lstm_model():
    input_text = Input(shape=(None, ), dtype="string", name='input_layer')
    embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
    x = TimeDistributed(Dense(50, activation="relu"))(x)
    crf = CRF(no_of_tags, sparse_target=True)
    out = crf(x)
    model = Model(input_text, out)
    return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
yongzhuo commented 5 years ago

I have the same issue, when I am training an BERT + Bi-LSTM + CRF network for named entity recognition of chinese_people_daily. how could I slove it? some logs as follows

20800/20864 [============================>.] - ETA: 0s - loss: -0.6666 - crf_accuracy: 0.9548 20816/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20832/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20848/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20864/20864 [==============================] - 107s 5ms/step - loss: -0.6667 - crf_accuracy: 0.9548 - val_loss: -0.6612 - val_crf_accuracy: 0.9498

touhi99 commented 5 years ago

I am experiencing same issue for CRF layer

touhi99 commented 5 years ago

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

@lzfelix How do I do that? I am new to this thing. Currently this is my code.

def lstm_model():
  input_text = Input(shape=(None, ), dtype="string", name='input_layer')
  embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
  x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
  x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
  x = TimeDistributed(Dense(50, activation="relu"))(x)
  crf = CRF(no_of_tags, sparse_target=True)
  out = crf(x)
  model = Model(input_text, out)
  return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])

Did you figure it out while using ELMo?

Risico305 commented 5 years ago

Any solutions for this? The research says that crf at the end for the sequence labelling task performs better than Softmax layer, however, in the case of Named Entity Recognition I have seen worst results with crf. And I also experience the - ve loss. Any solution for this? Did anybody found the solution?

deanhoperobertson commented 4 years ago

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

@lzfelix How do I do that? I am new to this thing. Currently this is my code.

def lstm_model():
  input_text = Input(shape=(None, ), dtype="string", name='input_layer')
  embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
  x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
  x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
  x = TimeDistributed(Dense(50, activation="relu"))(x)
  crf = CRF(no_of_tags, sparse_target=True)
  out = crf(x)
  model = Model(input_text, out)
  return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])

@EvilPort2 you should only use sparse_target=True if your labels are NOT one-hot encoded.

EvilPort2 commented 4 years ago

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

@lzfelix How do I do that? I am new to this thing. Currently this is my code.

def lstm_model():
    input_text = Input(shape=(None, ), dtype="string", name='input_layer')
    embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
    x = TimeDistributed(Dense(50, activation="relu"))(x)
    crf = CRF(no_of_tags, sparse_target=True)
    out = crf(x)
    model = Model(input_text, out)
    return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])

@EvilPort2 you should only use sparse_target=True if your labels are NOT one-hot encoded.

@deanhoperobertson I had figured it out way earlier. But thanks for your reply anyway.

By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...

@lzfelix How do I do that? I am new to this thing. Currently this is my code.

def lstm_model():
    input_text = Input(shape=(None, ), dtype="string", name='input_layer')
    embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
    x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
    x = TimeDistributed(Dense(50, activation="relu"))(x)
    crf = CRF(no_of_tags, sparse_target=True)
    out = crf(x)
    model = Model(input_text, out)
    return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])

Did you figure it out while using ELMo?

@tapos12 Yes. I figured out that while I was using ELMO

junglefish8086 commented 4 years ago

I have the same issue, when I am training an BERT + Bi-LSTM + CRF network for named entity recognition of chinese_people_daily. how could I slove it? some logs as follows

20800/20864 [============================>.] - ETA: 0s - loss: -0.6666 - crf_accuracy: 0.9548 20816/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20832/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20848/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20864/20864 [==============================] - 107s 5ms/step - loss: -0.6667 - crf_accuracy: 0.9548 - val_loss: -0.6612 - val_crf_accuracy: 0.9498

@yongzhuo 你好,请问你现在解决这个损失为负的问题了吗?

NirmalenduPrakash commented 4 years ago

I have the same issue. The loss is limited to small negative value. This loss should never go negative(https://arxiv.org/pdf/1603.01360.pdf). I think this is because of logsumexp(doind direct exp and sum can be unstable) approximation.This anyways this doesn't matter because following the gradient can only lead to best path probability(emission+transition) to be equal to sum of all probabilities(all paths), only slightly negative in this case due to approximation.

waring92 commented 4 years ago

I think because of activation function in crf layer used in calculating negative log likelihood scalar. Add 'Relu' activation function of keras where loss is calculated. then, i solved this problem without any changes in prediction performance

    def loss(y_true, y_pred):
        X = crf.input
        mask = crf.input_mask
        nloglik = crf.get_negative_log_likelihood(y_true, X, mask)
        return tf.keras.activations.relu(nloglik)
pinesnow72 commented 4 years ago

I've found a solution of this issue. Refer to my comment

cpmss521 commented 4 years ago

I have the same issue, how could I slove it?

pinesnow72 commented 4 years ago

I have the same issue, how could I slove it?

Please refer to this link