Open mdetournay opened 6 years ago
Is your backend Theano? Maybe you can see #131.
I've observed the same behaviour under TensorFlow. Package versions:
keras-contrib==2.0.8
Keras==2.2.0
tensorflow==1.8.0
The odd thing is that accuracy keeps going up and the model is indeed learning to predict from the data. What I find strange is that on the join
mode, the CRF minimizes the log-likelihood, which, as far as I know, is a positive function, since it's the negative of the log of a probability.
Some logs from the training:
3541
Epoch 1/10
457/457 [==============================] - 30s 65ms/step - loss: 0.7630 - acc: 0.7884 - val_loss: 0.4350 - val_acc: 0.8624
Epoch 00001: val_loss improved from inf to 0.43499, saving model to ./models/model.hdf5
Epoch 2/10
457/457 [==============================] - 28s 60ms/step - loss: 0.3308 - acc: 0.8854 - val_loss: 0.2464 - val_acc: 0.9071
Epoch 00002: val_loss improved from 0.43499 to 0.24643, saving model to ./models/model.hdf5
Epoch 3/10
457/457 [==============================] - 28s 60ms/step - loss: 0.1710 - acc: 0.9122 - val_loss: 0.1427 - val_acc: 0.9204
Epoch 00003: val_loss improved from 0.24643 to 0.14268, saving model to ./models/model.hdf5
Epoch 4/10
457/457 [==============================] - 27s 60ms/step - loss: 0.0672 - acc: 0.9243 - val_loss: 0.0708 - val_acc: 0.9257
Epoch 00004: val_loss improved from 0.14268 to 0.07085, saving model to ./models/model.hdf5
Epoch 5/10
457/457 [==============================] - 27s 60ms/step - loss: -0.0142 - acc: 0.9310 - val_loss: 0.0106 - val_acc: 0.9297
Epoch 00005: val_loss improved from 0.07085 to 0.01058, saving model to ./models/model.hdf5
Epoch 6/10
457/457 [==============================] - 28s 60ms/step - loss: -0.0833 - acc: 0.9356 - val_loss: -0.0448 - val_acc: 0.9327
@linxihui, could you please comment on this issue? By the way, thanks for the implementation.
I have the same issue, did you end up figuring this out?
Not yet, as I’m masking my inputs... But it should be fixed..
On Fri, 3 Aug 2018 at 16:07 Sharon Rapoport notifications@github.com wrote:
I have the same issue, did you end up figuring this out?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras-contrib/issues/253#issuecomment-410348190, or mute the thread https://github.com/notifications/unsubscribe-auth/AIi3X25tKMzKMHjhGTpEXsxBK3u-ZxxVks5uNJ9XgaJpZM4UE3Qb .
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
On Fri, 3 Aug 2018 at 17:14 Luiz Felix lzcfelix@gmail.com wrote:
Not yet, as I’m masking my inputs... But it should be fixed..
On Fri, 3 Aug 2018 at 16:07 Sharon Rapoport notifications@github.com wrote:
I have the same issue, did you end up figuring this out?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras-contrib/issues/253#issuecomment-410348190, or mute the thread https://github.com/notifications/unsubscribe-auth/AIi3X25tKMzKMHjhGTpEXsxBK3u-ZxxVks5uNJ9XgaJpZM4UE3Qb .
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
@lzfelix How do I do that? I am new to this thing. Currently this is my code.
def lstm_model():
input_text = Input(shape=(None, ), dtype="string", name='input_layer')
embedding = Lambda(get_elmo, name='elmo_embedding')(input_text)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x)
x = TimeDistributed(Dense(50, activation="relu"))(x)
crf = CRF(no_of_tags, sparse_target=True)
out = crf(x)
model = Model(input_text, out)
return model, crf
model, crf = lstm_model()
model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
I have the same issue, when I am training an BERT + Bi-LSTM + CRF network for named entity recognition of chinese_people_daily. how could I slove it? some logs as follows
20800/20864 [============================>.] - ETA: 0s - loss: -0.6666 - crf_accuracy: 0.9548 20816/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20832/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20848/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20864/20864 [==============================] - 107s 5ms/step - loss: -0.6667 - crf_accuracy: 0.9548 - val_loss: -0.6612 - val_crf_accuracy: 0.9498
I am experiencing same issue for CRF layer
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
@lzfelix How do I do that? I am new to this thing. Currently this is my code.
def lstm_model(): input_text = Input(shape=(None, ), dtype="string", name='input_layer') embedding = Lambda(get_elmo, name='elmo_embedding')(input_text) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x) x = TimeDistributed(Dense(50, activation="relu"))(x) crf = CRF(no_of_tags, sparse_target=True) out = crf(x) model = Model(input_text, out) return model, crf model, crf = lstm_model() model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy]) model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
Did you figure it out while using ELMo?
Any solutions for this? The research says that crf at the end for the sequence labelling task performs better than Softmax layer, however, in the case of Named Entity Recognition I have seen worst results with crf. And I also experience the - ve loss. Any solution for this? Did anybody found the solution?
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
@lzfelix How do I do that? I am new to this thing. Currently this is my code.
def lstm_model(): input_text = Input(shape=(None, ), dtype="string", name='input_layer') embedding = Lambda(get_elmo, name='elmo_embedding')(input_text) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x) x = TimeDistributed(Dense(50, activation="relu"))(x) crf = CRF(no_of_tags, sparse_target=True) out = crf(x) model = Model(input_text, out) return model, crf model, crf = lstm_model() model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy]) model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
@EvilPort2 you should only use sparse_target=True if your labels are NOT one-hot encoded.
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
@lzfelix How do I do that? I am new to this thing. Currently this is my code.
def lstm_model(): input_text = Input(shape=(None, ), dtype="string", name='input_layer') embedding = Lambda(get_elmo, name='elmo_embedding')(input_text) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x) x = TimeDistributed(Dense(50, activation="relu"))(x) crf = CRF(no_of_tags, sparse_target=True) out = crf(x) model = Model(input_text, out) return model, crf model, crf = lstm_model() model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy]) model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
@EvilPort2 you should only use sparse_target=True if your labels are NOT one-hot encoded.
@deanhoperobertson I had figured it out way earlier. But thanks for your reply anyway.
By the way, even if you don’t need masking, I have found that adding a dummy mask layer after the input layer fixes this issue...
@lzfelix How do I do that? I am new to this thing. Currently this is my code.
def lstm_model(): input_text = Input(shape=(None, ), dtype="string", name='input_layer') embedding = Lambda(get_elmo, name='elmo_embedding')(input_text) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm1')(embedding) x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.5, dropout=0.5), name='bilstm2')(x) x = TimeDistributed(Dense(50, activation="relu"))(x) crf = CRF(no_of_tags, sparse_target=True) out = crf(x) model = Model(input_text, out) return model, crf model, crf = lstm_model() model.compile(optimizer='rmsprop', loss=crf.loss_function, metrics=[crf.accuracy]) model.fit(np.array(train_x), train_y, validation_data=(np.array(valid_x), valid_y), shuffle=True, batch_size=batch_size, epochs=1, verbose=1, callbacks=[checkpoint])
Did you figure it out while using ELMo?
@tapos12 Yes. I figured out that while I was using ELMO
I have the same issue, when I am training an BERT + Bi-LSTM + CRF network for named entity recognition of chinese_people_daily. how could I slove it? some logs as follows
20800/20864 [============================>.] - ETA: 0s - loss: -0.6666 - crf_accuracy: 0.9548 20816/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20832/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20848/20864 [============================>.] - ETA: 0s - loss: -0.6667 - crf_accuracy: 0.9548 20864/20864 [==============================] - 107s 5ms/step - loss: -0.6667 - crf_accuracy: 0.9548 - val_loss: -0.6612 - val_crf_accuracy: 0.9498
@yongzhuo 你好,请问你现在解决这个损失为负的问题了吗?
I have the same issue. The loss is limited to small negative value. This loss should never go negative(https://arxiv.org/pdf/1603.01360.pdf). I think this is because of logsumexp(doind direct exp and sum can be unstable) approximation.This anyways this doesn't matter because following the gradient can only lead to best path probability(emission+transition) to be equal to sum of all probabilities(all paths), only slightly negative in this case due to approximation.
I think because of activation function in crf layer used in calculating negative log likelihood scalar. Add 'Relu' activation function of keras where loss is calculated. then, i solved this problem without any changes in prediction performance
def loss(y_true, y_pred):
X = crf.input
mask = crf.input_mask
nloglik = crf.get_negative_log_likelihood(y_true, X, mask)
return tf.keras.activations.relu(nloglik)
I've found a solution of this issue. Refer to my comment
I have the same issue, how could I slove it?
I have the same issue, how could I slove it?
Please refer to this link
Hi,
I am training an LSTM - CRF network for named entity recognition. When using crf.loss_function, I'm getting negative losses after a few epochs. Before I go on giving more details about my code, is this even possible with this crf.loss_function? I tried looking at the code but because of it's recursive nature I'm having trouble identifying where I could be getting negative values.
Moreover, when training without the CRF layer, but with Time-distributed dense (softmax) and categorical cross entropy, losses stay positive and training works correctly.
What could I investigate to identify where the problem is?
Thank you very much for your help!
Cheers
Martin
Here is the network:
Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 170) 0
embedding_1 (Embedding) (None, 170, 100) 22203800
bidirectional_1 (Bidirection (None, 170, 128) 63360
time_distributed_1 (TimeDist (None, 170, 64) 8256
crf_1 (CRF) (None, 170, 3) 210
Total params: 22,275,626 Trainable params: 71,826 Non-trainable params: 22,203,800