mxhofer / Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL

Keras implementation of "Few-shot Learning for Named Entity Recognition in Medical Text"
https://arxiv.org/abs/1811.05468
MIT License
178 stars 83 forks source link

One question regarding padding #4

Open Chandrak1907 opened 5 years ago

Chandrak1907 commented 5 years ago

hi,

I see that you are padding the inputs to get equal length of 52. But, it seems padding is applied to only character inputs but not to words.

 # 0-pads all words
 def padding(Sentences):
     maxlen = 52
     for sentence in Sentences:
         char = sentence[2]
         for x in char:
             maxlen = max(maxlen, len(x))
     for i, sentence in enumerate(Sentences):
         Sentences[i][2] = pad_sequences(Sentences[i][2], 52, padding='post')
     return Sentences

Sentences contains below:

         dataset.append([wordIndices, caseIndices, charIndices, labelIndices]) 
     return dataset

I see that you have made batches of inputs with words of equal length. Is this the correct approach? Can you pls let me know.

mxhofer commented 5 years ago

Hello! That's correct. This is because the convolutional neural net (CNN) processes the padded character vectors of equal length. Alternatively, one could split the CNN input into equal-size batches (e.g. here). Each input batch to the bi-directional LSTM has the same length, depending on how many words there are in a document.

Chandrak1907 commented 5 years ago

Thank you for responding. One follow up question. What was the rationale behind padding characters to a maximum length of 52. There will be 26 upper case letters, 26 lower case letters and other punctuation characters. Can you pls let me know?

mxhofer commented 5 years ago

The maximum length was chosen after analyzing word lengths in the documents, such that no words are cut off.

Chandrak1907 commented 5 years ago

There is some confusion. In my understanding, padding is applied to only characters not to words.

mxhofer commented 5 years ago

Padding is indeed applied to characters. For example, the padded character-level input below is for the word "RECORD". The output of the padding(Sentences) function is a list of documents of a list of words, cases, characters and labels (see output of the createMatrices(sentences, word2Idx, label2Idx, case2Idx, char2Idx) function.

screenshot 2018-12-16 at 14 40 57