jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.42k stars 2.07k forks source link

How to get single word embeddings? #343

Closed Janinanu closed 5 years ago

Janinanu commented 5 years ago

I have a list with individual tokens (like a vocab) that I want to extract the Bert embeddings for. I used this command to run the Bert service:

bert-serving-start -model_dir multi_cased_L-12_H-768_A-12/ -pooling_strategy=NONE -max_seq_len=4 -num_worker=3

The crucial part, I assume, is the max_seq_len. If I set max_seq_len=1, it tells me that 3 is an invalid int value must be >3 (account for maximum three special symbols in BERT model) or NONE However, if I set max_seq_len=None, the service does not even start, and if I set max_seq_len=4, I get an output of dimension vocab_size x 4 x embedding_dimension, even though I want simply an output of dimension vocab_size x embedding_dimension. I am wondering: What is meant by the "three special symbols in BERT model"? Does it refer to the distinction between token embeddings, segment embeddings and position embeddings in the Bert model? In the output that I get, the first 3 vectors for any token are non-zero, the 4th vectors is all zero. Therefore, would the actual word embedding simply be the sum over these three as described in the original Bert paper?

PeterisP commented 5 years ago

For token-level embeddings see https://github.com/hanxiao/bert-as-service#getting-elmo-like-contextual-word-embedding instead of attempting to use 1-word "sentences".

Janinanu commented 5 years ago

Okay, thanks. I now understand that it probably does not make a lot of sense to extract single word embeddings from the Bert model.