Word-level padding vs Character-level padding

helboukkouri / character-bert

Main repository for "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"

Apache License 2.0

196 stars 46 forks source link

Word-level padding vs Character-level padding #18

Closed IstiaqAnsari closed 1 year ago

IstiaqAnsari commented 2 years ago

Hi @helboukkouri The max numbers of letters in a word is set to 50. So for a word with 5 characters is getting padded to 50. For padding, a value of 260 is used for each character. Then again, to make each sentence in a batch same size we are padding with words. Say to make a sentence of length 5 (5 words) pad to 8, three PAD tokens are being added. In this case each PAD token is also 50 character length but each character is getting a padding value of ZERO. Why are you using two different types of padding? Another thing, after converting each word to ids, you are adding 1 to each id. ( in the file character-bert/utils/character_cnn.py line 125 in the function def convert_word_to_char_ids(self, word: str) -> List[int]: and the comment says # +1 one for masking What is the reason of adding 1 ? Thanks in advance.

helboukkouri commented 1 year ago

Thank you for the interest in my work.

To be completely honest, there are many things I've done as a direct import from whatever code was used in ELMo (from which the CharacterCNN module is taken basically). So there might be some weird or suboptimal choices sometimes.

Also, it's been a while so I'm not sure anymore but I think that the +1 is to basically reserve the case "vector of all zeros" for the padding character.

In any case, I plan to re-do a more recent version of this work soon which will hopefully be clearer.

helboukkouri commented 1 year ago

Maybe this will be more useful @IstiaqAnsari

A word is converted into a sequence of bytes. These bytes go from 0 to 255 in UTF-8. However, we also want to be able to pad shorted words into a maximum word length. For that, we use characted index 260 (since 256-259 are reserved). To avoid confusions between an empty token (pad) and a token that has a single character which is using the character with index 0, we also choose to reserve the all 0 vector for empty padding tokens. As a result, the character with index 0 cannot actually use the value 0 and so we shift everything with a +1.

I think it would have been possible to just have a "normal" token starting with BOW/EOW and with a single special character index to designate the padding token. Same as what is done with the MASK/CLS/SEP tokens.

Its a little bit convoluted but I based this on the implementation in ELMo at the time :) Hope this clears things up.

helboukkouri commented 1 year ago

Also, I guess there is a benefit to having an zero vector for the word padding as this can be initialized to an all 0 embedding in the character embedding matrix :)