Which layer should I use if I only want to embed char

helboukkouri / character-bert

Main repository for "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"

Apache License 2.0

195 stars 47 forks source link

Which layer should I use if I only want to embed char #5

Closed zmddzf closed 3 years ago

zmddzf commented 3 years ago

Hi, I want to use your character BERT in my research, but I'm confused which layer should I choose if I only want to do char-level embedding. Should I just use the CharacterCNN? Thank u!

helboukkouri commented 3 years ago

Hi @zmddzf and sorry for the delay.

It depends on what you mean by character-level embeddings. If you mean token/word embeddings that are constructed using character-level information then you can use the final output layer directly. If you need these token/word embeddings to be context independent then you can use the output of the CharacterCNN (before adding positional and segment embeddings).

If you want the sequence of character embeddings for each token/word. Then you will need to dig a little bit deeper inside the CharacterCNN.

Let me know if this answers your question. 😊

Cheers!

zmddzf commented 3 years ago

Thanks for your kind reply! My research project needs character vectors rather than word vector, for example, "ann" should be represented as [V_a, V_n, V_n], each vector corresponds to a single character. It seems CharacterCNN can only represent a word to a single vector. If I only take the embedding layer of CharacterCNN, the character embedding is static, and 16-dim embedding is less powerful. I'm not sure that if I treat each character as a token and use CharacterCNN, the output vector can reflect the character combination pattern (like "ea", "ing").

Thank u!

helboukkouri commented 3 years ago

I'm not sure that if I treat each character as a token and use CharacterCNN, the output vector can reflect the character combination pattern (like "ea", "ing").

What you could do is tokenize your text at the character-level (e.g. "Hello!" becomes [h, e, l, l, o, !]) and then use CharacterBERT as is. By doing that is will produce contextualize character-level representations, but the issue will probably be that the model wasn't pre-trained that way. Therefore, the quality of the output representations might not be satisfactory.

But then to be honest, you may do the same with BERT as probably all characters appear in the WordPiece vocabulary of the model 😊