jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.42k stars 2.07k forks source link

[Enhancement] - Return the exact tokens from the vocabulary the sentence is parsed into #169

Closed rexdouglass closed 5 years ago

rexdouglass commented 5 years ago

Two quick related enhancement ideas: 1) Retrieve the actual text of the tokens, alongside their embedding. It's currently a little hard to debug why a string didn't get classified correctly when I can't easily tell how it got tokenized. Apologies if that's already possible, the documentation under "Getting ELMo-like contextual word embedding" suggests it isn't, you just kind of have to know before hand how your sentence will be tokenized.

2) Ability to also zero mask the [CLS] and [SEP]. The recent ability to zero-mask null tokens was much appreciated, and for some applications, we also want to throw away the end tokens, and keep just the word tokens. Currently having to do that client side, which requires awkwardly counting off non-zero embeddings to get a count of tokens+2, and then masking those rows.

Thanks for a fantastic package.

Please fill in by replacing [ ] with [x].

hanxiao commented 5 years ago

Both are very good points and I will implement them in the next version. Thanks!

hanxiao commented 5 years ago

fyi, these two issues are fixed in #171 and the new feature is available since 1.6.6. Please do the following for the upgrade:

pip install -U bert-serving-server bert-serving-client
  1. To get the tokenization information, you can add -verbose when starting the server. This enables the logging for tokenization, mask etc. For now, such information is not available on the client side.
  2. You can add -mask_cls_sep when starting the server. This will set the embedding of [CLS] and [SEP] to zero before pooling. If you -pooling_strategy is one of {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN}, then the embedding on [CLS] and [SEP] are preserved.
hanxiao commented 5 years ago

fyi, this feature (retrieving tokenization from server) is implemented in #226 and is available since 1.8.0. Please do

pip install -U bert-serving-server bert-serving-client

for the update. Usage can be found in the tutorial https://github.com/hanxiao/bert-as-service#using-your-own-tokenizer or documentation: https://bert-as-service.readthedocs.io/