Closed rexdouglass closed 5 years ago
Both are very good points and I will implement them in the next version. Thanks!
fyi, these two issues are fixed in #171 and the new feature is available since 1.6.6. Please do the following for the upgrade:
pip install -U bert-serving-server bert-serving-client
-verbose
when starting the server. This enables the logging for tokenization, mask etc. For now, such information is not available on the client side.-mask_cls_sep
when starting the server. This will set the embedding of [CLS] and [SEP] to zero before pooling. If you -pooling_strategy
is one of {CLS_TOKEN, FIRST_TOKEN, SEP_TOKEN, LAST_TOKEN}
, then the embedding on [CLS] and [SEP] are preserved.fyi, this feature (retrieving tokenization from server) is implemented in #226 and is available since 1.8.0. Please do
pip install -U bert-serving-server bert-serving-client
for the update. Usage can be found in the tutorial https://github.com/hanxiao/bert-as-service#using-your-own-tokenizer or documentation: https://bert-as-service.readthedocs.io/
Two quick related enhancement ideas: 1) Retrieve the actual text of the tokens, alongside their embedding. It's currently a little hard to debug why a string didn't get classified correctly when I can't easily tell how it got tokenized. Apologies if that's already possible, the documentation under "Getting ELMo-like contextual word embedding" suggests it isn't, you just kind of have to know before hand how your sentence will be tokenized.
2) Ability to also zero mask the
[CLS]
and[SEP].
The recent ability to zero-mask null tokens was much appreciated, and for some applications, we also want to throw away the end tokens, and keep just the word tokens. Currently having to do that client side, which requires awkwardly counting off non-zero embeddings to get a count of tokens+2, and then masking those rows.Thanks for a fantastic package.