Closed IstiaqAnsari closed 1 year ago
Thank you for the interest in my work.
To be completely honest, there are many things I've done as a direct import from whatever code was used in ELMo (from which the CharacterCNN module is taken basically). So there might be some weird or suboptimal choices sometimes.
Also, it's been a while so I'm not sure anymore but I think that the +1 is to basically reserve the case "vector of all zeros" for the padding character.
In any case, I plan to re-do a more recent version of this work soon which will hopefully be clearer.
Maybe this will be more useful @IstiaqAnsari
A word is converted into a sequence of bytes. These bytes go from 0 to 255 in UTF-8. However, we also want to be able to pad shorted words into a maximum word length. For that, we use characted index 260 (since 256-259 are reserved). To avoid confusions between an empty token (pad) and a token that has a single character which is using the character with index 0, we also choose to reserve the all 0 vector for empty padding tokens. As a result, the character with index 0 cannot actually use the value 0 and so we shift everything with a +1.
I think it would have been possible to just have a "normal" token starting with BOW/EOW and with a single special character index to designate the padding token. Same as what is done with the MASK/CLS/SEP tokens.
Its a little bit convoluted but I based this on the implementation in ELMo at the time :) Hope this clears things up.
Also, I guess there is a benefit to having an zero vector for the word padding as this can be initialized to an all 0 embedding in the character embedding matrix :)
Hi @helboukkouri The max numbers of letters in a word is set to 50. So for a word with 5 characters is getting padded to 50. For padding, a value of 260 is used for each character. Then again, to make each sentence in a batch same size we are padding with words. Say to make a sentence of length 5 (5 words) pad to 8, three PAD tokens are being added. In this case each PAD token is also 50 character length but each character is getting a padding value of ZERO. Why are you using two different types of padding? Another thing, after converting each word to ids, you are adding 1 to each id. ( in the file
character-bert/utils/character_cnn.py
line 125 in the functiondef convert_word_to_char_ids(self, word: str) -> List[int]:
and the comment says# +1 one for masking
What is the reason of adding 1 ? Thanks in advance.