Open Chandrak1907 opened 5 years ago
Hello! That's correct. This is because the convolutional neural net (CNN) processes the padded character vectors of equal length. Alternatively, one could split the CNN input into equal-size batches (e.g. here). Each input batch to the bi-directional LSTM has the same length, depending on how many words there are in a document.
Thank you for responding. One follow up question. What was the rationale behind padding characters to a maximum length of 52. There will be 26 upper case letters, 26 lower case letters and other punctuation characters. Can you pls let me know?
The maximum length was chosen after analyzing word lengths in the documents, such that no words are cut off.
There is some confusion. In my understanding, padding is applied to only characters not to words.
Padding is indeed applied to characters. For example, the padded character-level input below is for the word "RECORD". The output of the padding(Sentences)
function is a list of documents of a list of words, cases, characters and labels (see output of the createMatrices(sentences, word2Idx, label2Idx, case2Idx, char2Idx)
function.
hi,
I see that you are padding the inputs to get equal length of 52. But, it seems padding is applied to only character inputs but not to words.
Sentences contains below:
I see that you have made batches of inputs with words of equal length. Is this the correct approach? Can you pls let me know.