Feature Request: Linear Chain Conditional Random Field

phipleg commented 7 years ago

I implemented a Linear Chain CRF layer for sequence tagging tasks inspired by the paper:

Lample et al. Neural Architectures for Named Entity Recognition (Neural Architectures for Named Entity Recognition)

The layer is the identity function during training and applies the forward-backward algorithm during inference. For that it holds a set of trainable parameters which is accessed by a specific loss function.

You can see the API in the short gist for pos tagging on the penn treebank (provided by NLTK): https://gist.github.com/phipleg/adfccb0ad96b777eecc9bb0f16ab54fc

Currently, it is only implemented in Theano and supports fixed length sequences (no masking).

Is anybody interested in seeing the layer in Keras? The need was raised in issue 824 but the issue is closed.

I could refactor my code and make a pull request in a few days. For that I would need to add a few functions to the Theano backend because I make use of Theano's scan function. I could also provide an example.

tmills commented 7 years ago

I would find this useful.

phipleg commented 7 years ago

Ok, great! Before I make the pull request I try to figure out the implementation in tensorflow in order to get the ChainCRF going with both backends.

kaya27 commented 7 years ago

I'm interesting,I would find this very useful.

efosler commented 7 years ago

Following as well. I was about to embark on an implementation myself. (May still just for the exercise, but if you have something I would be interested.)

matteotosi commented 7 years ago

Interested as well! ;)

c4n commented 7 years ago

This is awesome!

sonalgupta commented 7 years ago

This will be great!

phipleg commented 7 years ago

Thanks for your support! I am almost done, except for a bug in my tensorflow implementation. I hope to resolve this issue in a few days.

danche354 commented 7 years ago

It will be an awesome function!

xiangyanchao commented 7 years ago

Great, waiting for update！

phipleg commented 7 years ago

Sorry, for the delay. I am still working on it, but in my spare time. The layer is complete, but the example is not finished yet.

nreimers commented 7 years ago

I'm interested as well. I have the problem that RNNs are not able to capture e.g. BIO-encoding correctly and produce ill formated BIO-tags (e.g. starting an I-tag without a previous B-tag).

Thanks for contributing and looking forward to your implementation.

sunbohit commented 7 years ago

python3 conll2000_bi_lstm_crf.py Using TensorFlow backend. I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally Traceback (most recent call last): File "conll2000_bi_lstm_crf.py", line 16, in from keras.layers import Dense, Embedding, ChainCRF, LSTM, Bidirectional, Dropout ImportError: cannot import name 'ChainCRF'

I have run the setup.py in https://github.com/fchollet/keras/tree/bba6b521abc462261dd65883be59c94e1467b7cf And I can see the 'crf,py' in /keras/layers/, but when I ran the '/examples/conll2000_bi_lstm_crf.py', I got the ImportError.

What is the right way to run this file?

phipleg commented 7 years ago

This should work if the library is properly installed. I guess you had a previous keras version in your conda environment. Then your install didn't update existing files but just added the new ones. For example, keras/layers/__init__.py which is likely to be the source of the error.

Try again:

python setup.py install --force

You can check the installation by running

python3 -c "from keras.layers import ChainCRF"

If this doesn' throw an ImportError, the example should work.

sunbohit commented 7 years ago

Thanks. After uninstalled my previous keras version, I successfully imported the package.

lemmonation commented 7 years ago

hi I'm running a blstm-crf model but before the training begins I meet the following error: Train on 1860255 samples, validate on 206696 samples Epoch 1/5 Traceback (most recent call last): File "train_keras_model.py", line 125, in args.batchsize, args.maxlen, args.maxepochs, args.hiddenunits, args.dropout) File "train_keras_model.py", line 96, in train nb_epoch = maxepochs, validation_data = (test_X,Y_test)) File "/home/junliangguo/keras-b/keras/models.py", line 652, in fit sample_weight=sample_weight) File "/home/junliangguo/keras-b/keras/engine/training.py", line 1111, in fit initial_epoch=initial_epoch) File "/home/junliangguo/keras-b/keras/engine/training.py", line 826, in _fit_loop outs = f(ins_batch) File "/home/junliangguo/keras-b/keras/backend/tensorflow_backend.py", line 1096, in call updated = session.run(self.outputs + [self.updates_op], feed_dict=feed_dict) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 894, in _run % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape()))) ValueError: Cannot feed value of shape (20, 4) for Tensor u'chaincrf_1_target:0', which has shape '(?, ?, ?)'

and my model is as follows: model = Sequential() model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1, input_length = maxlen, mask_zero = False, weights = [embeddingWeights])) model.add(Bidirectional(LSTM(output_dim = hiddenDims, return_sequences = True), merge_mode = 'concat')) model.add(Dropout(dropout)) model.add(TimeDistributed(Dense(outputDims))) crf = ChainCRF() model.add(crf) model.compile(loss = crf.loss, optimizer = 'adam', metrics = ["accuracy"])

before that I add a TimeDistributed wrapper to make the input dim of CRF be correct.But I don't know what this error means.Could somebody help me?

phipleg commented 7 years ago

In your setting, the targets must be one-hot encoded and hence of dimension 3 (and not 2), i.e:

Y_test.shape = (nb_samples, timesteps, nb_classes)

lemmonation commented 7 years ago

Thanks for the reply but I'm not very clear about the shape.

After my preprocess I use Y_test = np_utils.to_categorical(test_y, LabelDims) So one batch of Y has the size (batchsize, LabelDims) ,i.e. (nb_samples, nb_classes)

So how can I take the dimension transition from (nb_samples, nb_classes) to (nb_samples, timesteps, nb_classes)``? Is there any functions or do I need to have a change in the preprocess step?

Thanks a lot.

phipleg commented 7 years ago

You cannot make the desired dimension transition. The model works only for temporal data, but your preprocessing shows that this is not true in your case. Why are you trying to use a ChainCRF?

lemmonation commented 7 years ago

I use lstm and crf to Chinese sequence segmentation. In my preprocess I use a window sliding the sentence so the training data has X.size = (total_count, window_size) and Y.size = (total_count, ).

It seems that I should set a sequence length as timesteps to form the data shape to be X.size = (batch_size, seq_len, window_size) and Y.size = (batch_size, seq_len)

But I have used an embedding layer which make the output dim to the LSTM layer become (batch_size, window_size, embedding_dim).And the network works well and has a good result because it takes window_size as time steps, with a set of return_sequences = False in the last LSTM layer, which make the dimension correct between network's output and Y.

And I notice that the ChainCRF layer doesn't support mask_zero argument yet. So does it means I should discard the embedding layer and retrim the data dimension to be X.size = (batch_size, seq_len, window_size) and Y.size = (batch_size, seq_len, n_classes) (after a categorical)?

lemmonation commented 7 years ago

Thanks for the advice and I have already fixed the problem by discrad embedding layer and retrim data dimension.

LopezGG commented 7 years ago

Can I use this Chain CRF to implement BiLSTM with CRF for NER tagging as shown in the code here https://github.com/glample/tagger

SamihYounes commented 7 years ago

got this error any help ?

Loading data... Unique words: 17260 Unique pos_tags: 45 Unique chunk tags: 24 X_words_train shape: (8936, 80) X_words_test shape: (2012, 80) y_train shape: (8936, 80, 1) y_test shape: (2012, 80, 1) Build model... File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 569, in call self.add_inbound_node(inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 632, in add_inbound_node Node.create_node(self, inbound_layers, node_indices, tensor_indices) File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 164, in create_node output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0])) File "/usr/local/lib/python2.7/dist-packages/keras/layers/crf.py", line 122, in call y_pred = K.crf_inference(x, self.U, self.b) AttributeError: 'module' object has no attribute 'crf_inference'

phipleg commented 7 years ago

Hi @SamihYounes,

Please update to the latest version of the pull request #4621.

phipleg commented 7 years ago

Hi @LopezGG,

yes this will be possible. You can start with the chunking example. When the CRF layer is finally merged, I will provide an example.

NianzuMa commented 7 years ago

I have a question about the example code

n_words = 10000
maxlen = 32
(X_train, y_train), (X_test, y_test) = load_treebank(nb_words=n_words, maxlen=maxlen)

n_samples, n_steps, n_classes = y_train.shape

model = Sequential()
model.add(Embedding(n_words, 128, input_length=maxlen, dropout=0.2))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(n_classes)))
model.add(Dropout(0.2))
crf = ChainCRF()
model.add(crf)
model.compile(loss=crf.loss, optimizer='rmsprop', metrics=['accuracy'])

After LSTM layer, Dense layer convert each timestep of the LSTM output into n_classes dimension. For instance, the output of LSTM is (batch_size, timestep, lstm_output), which is (64, 3, 100). Why this output should be converted to (64, 3, nb_class) using Dense layer. Couldn't data of shape (64, 3, 100) be the direct input to CRF layer and then make the output of CRF layer be (64, 3, nb_class)?

CRF could take features of size 100 for each timestep and then output tags of size 8 right?

phipleg commented 7 years ago

Dear @JacobIsrael123,

Of course, we could integrate a dense layer for input dimension conversion but this is not always necessary (for example if the preceding layer is recurrent layer with the output dimension nb_classes). While designing the ChainCRF layer, I decided to keep it simple as possible.

williamFalcon commented 7 years ago

where does load_treebank come from? what is the shape of the data? @JacobIsrael123

Ethan1214 commented 7 years ago

Hello, @phipleg: I want to try @JacobIsrael123 's idea : let the output of every timestep be the direct input to CRF layer. Can I add CRF layer like this: model.add(TimeDistributed(ChainCRF()))
and in this case, what is the shape of CRF layer‘s output at every timestep? 4 dimensional vector（if my data has 4 labels,like 'B I E O'）

My opinion is that, the CRF layer take place of the function of a dense layer which like this: model.add(TimeDistributed(Dense(Y_train.shape[2], activation='softmax'))) (use softmax activation to deal with some simple Sequencial-labeling tasks) Am I right?

phipleg commented 7 years ago

Hi @Ethan1214, The ChainCRF is not an activation function that you can apply manually on each timestep. The time dimension must be part of the input. In fact, the input and output shape of the ChainCRF are identical and equal to (batch_size, maxlen, n_classes).

For example:

vocab_size = 20
n_classes = 11
model = Sequential()
model.add(Embedding(vocab_size, n_classes))
layer = crf.ChainCRF()
model.add(layer)
model.compile(loss=layer.sparse_loss, optimizer='sgd')

batch_size, maxlen = 2, 5
x = np.random.randint(1, vocab_size, size=(batch_size, maxlen))
y = np.random.randint(n_classes, size=(batch_size, maxlen))
y = np.expand_dims(y, 2)
model.train_on_batch(x, y)

monod91 commented 7 years ago

Hi @phipleg , I find your implementation of ChainCRF great! I am having troubles getting my tagging model work. Forgive me if my question is trivial - I am a newbie in keras :) So basically I would like to compare a very simple model for tagging (with layers Embedding - LSTM - TimeDistributed Dense) with and without a CRF layer on the top. Something like

sentences = Input((maxlen,))
word_embeddings = Embedding(input_dim = (vocab_size + 1), output_dim = embedding_size, input_length = max_length)(sentences)
lstm_output = LSTM(output_dim = hidden_size, return_sequences = True)(word_embeddings)

crf_scores = TimeDistributed(Dense(n_classes))(lstm_output)
crf = ChainCRF()
probabilities = crf(crf_scores)
loss = crf.loss

model = Model(input = [sentences], output = probabilities)
model.compile(metrics = [categorical_accuracy, sample_weight_mode = "temporal", loss = loss, optimizer = "sgd")

My model works fine when I do not use any CRF layer on the top (like a TimeDistributed Dense layer with categorical crossentropy as loss). However, when trying to add the CRF layer I get the following error: ValueError: Input dimension mis-match. (input[0].shape[1] = BATCH_SIZE, input[1].shape[1] = MAX_LEN)

Did you also have the same problem in the past?

phipleg commented 7 years ago

Hi @monod91,

unfortunately, sample_weight_mode = "temporal" is not supported for the ChainCRF. This might cause your problem.

monod91 commented 7 years ago

Hi @phipleg , thank you for your answer.

At the beginning I thought it was a problem of importing - I could not import keras with the new CRF layer with pip, so I manually copied the scripts crf.py and init.py to my keras foder. I thought it could have caused a dependency problem - can you confirm that I do not have to copy other scripts to the keras folder?
So you are not using weigths? I was using them for kinda 'masking' padding (I did't want to use mask_zeros=True in the Embedding Layer because I wanted to be able to eventually use also convolutional layers, which do not support masking). May I ask you how you solved the problem of masking, then?

phipleg commented 7 years ago

I moved my work on the CRF to the crf branch in my fork (https://github.com/phipleg/keras/tree/crf), but it is not fully Keras 2 compliant yet (persistence is not supported and mini batches of size 1 are problematic)
Correct, I either use a mask or ignore the variable length issue. If I am not mistaken, a loss function in Keras returns a tensor of shape (batch_size, timesteps), i.e. a loss value per sample and time step. This is mapped to a tensor of shape (batch_size, ) by multiplying a the sample_weight array followed by taking average along the time axis. In contrast, the loss functions of a ChainCRF works differently and return tensor of shape (batch_size, ).

sujanucsc commented 7 years ago

Hi, I was using the previous CRF layer implementation and it was working fine. The I switched to the keras 2.0.0 and new CRF implementation in https://github.com/phipleg/keras/tree/crf. The new implementation produced unacceptable results. Then I ran the conll2000_bi_lstm_crf.py example and I have the same observation there. Seems there is a issue in loss functions. It produces 'loss: nan' for each epoch. e.g., 8936/8936 [==============================] - 157s - loss: nan - sparse_categorical_accuracy: 0.7038

Any ideas? Am I doing something wrong?

rvadaga commented 7 years ago

@phipleg Can you share the document/tutorial you used for implementing ChainCRF module? Thanks.

sujanucsc commented 7 years ago

The above issue with loss value is only observed with Theano backend. Tensorflow works fine.

liyzhang commented 6 years ago

new to git hub, how can I run setup in https://github.com/fchollet/keras/tree/bba6b521abc462261dd65883be59c94e1467b7cf?

if I use cntk, can I still use the crf layer?

FredRodrigues commented 6 years ago

@liyzhang

python setup.py install --force

dfalci commented 6 years ago

I am working on a sequence labeling task based on a bi-directional LSTM architecture with variable sequence length (I'm not padding sentences). Thus, during training, I have a lot of mini-batches, including those with size 1. @phipleg said in a previous post that "mini batches of size 1 are problematic". Does this mean that this implementation won't work in such situation?

nreimers commented 6 years ago

@dfalci This was fixed in a later commit, now the CRF implementation works fine for mini-batches of size 1.

A hint on speeding up your idea: What I do is to group sentences by sentence length and then create mini-batches of sentences with the same length. If your train data is large enough and the sentence are approx. of the same length, you will only have few mini-batches with a single sentence.

enewe101 commented 6 years ago

@phipleg I'm interested in using your implementation, but am wondering if you could elaborate on what this means:

The layer is the identity function during training and applies the forward-backward algorithm during inference. For that it holds a set of trainable parameters which is accessed by a specific loss function.

What I think this means is that the second-from-last layer, the one just before the CRF, is actually trying to predict the target directly, and learns to do so based on a loss function that is distinct from the CRF (e.g. cross-entropy). Meanwhile, the CRF learns a set of transition probabilities, based on its own loss function --- the log likelihood calculated from the forward-backward algorithm.

So, training of the CRF could be decoupled from training from the rest of the network, since the CRFs parameters do not affect the loss function as seen by the rest of the network ("The layer is the identity function during training").

Have I understood correctly?

While this seems reasonable, my reading of Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al 2015) and Neural Architectures for Named Entity Recognition (Lample 2016) is that their Bi-LSTM-CRF implementations are trained by back-propagating the CRF's log-likelihood loss function through the entire network. (I could be mistaken of course.)

nreimers commented 6 years ago

@enewe101 I can maybe answer that.

The CRF-Layer is updated using back-propagation during the training to learn transition probabilities. However, for training, we already know the correct labels. Hence, training and inference for a CRF or Hidden-Markov-Model in a Neural Network is distinct from those layers at inference. But the error function of the CRF layer is used for training and the transistions are updated with each epoch.

You could of course also decouple the training of the network from the training of the CRF / HMM. But this is seldom done as it introduces further complexity.

The paper of Collobert et al. 'NLP almost from scratch' explains the process well how to add a HMM to a network and how the training and inference must be modified.

In my implementation (https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf) I achieve on par results with Huang et al., Lample et al., and Ma & Hovy for various tasks using this CRF implementation. So it appears that this CRF implementation works well

phipleg commented 6 years ago

Hi @enewe101,

as @nreimers pointed out the CRF layer only applies the costly inference at prediction, and not at training because the target labels are already known. During training it acts as the identiy but at the same time holds the parameters for the CRF loss. You need to use this loss in your model (and not some cross-entropy like you said). Otherwise, by taking gradients, the CRF parameters won't get any updates.

enewe101 commented 6 years ago

thanks @nreimers and @phipleg !

chaxor commented 6 years ago

@phipleg I see that typically the CRF layer loss is applied to a single output layer. Is it possible to have two outputs, as in the functional API demo, while using the CRF loss form two different CRFs?

phipleg commented 6 years ago

Hi @chaxor,

have you tried already something like this?

input_for_crf1 = ...
input_for_crf2 = ...
crf1 = ChainCRF(params_for_crf1)(input_for_crf1)    
crf2 = ChainCRF(params_for_crf1)(input_for_crf2)

model = Model(inp, [out1,out2])
model.compile(optimizer = ...., loss = [crf1.loss, crf2.loss])

chaxor commented 6 years ago

@phipleg Well, that was a simple fix. Thank you so much for your help!

Peydon commented 6 years ago

hey, i just use the newest ChainCRF layer but the result is strange.The acc of train set first increase and then decrease,and the acc of val increase continuously, i am not clear about that, can you explain it ?@phipleg here is the train process.

phipleg commented 6 years ago

Dear @KARABAER,

it is hard to say without knowing your complete model, the data and training code. Please give more details.

keras-team / keras

Feature Request: Linear Chain Conditional Random Field #4090