ironflood commented 5 years ago

Hi,

First wanted to thank you for sharing this great tool, amazingly reliable so far and already embedded ~ a billion sentences.

There's a few things that could really be helpful (in order of importance):

1) allow the server (start param?) to return the tokens along their contextualized embeddings when the strategy is NONE. The use case of the strategy NONE (ELMO like) is pretty limited if we don't have the tokens back and keeping a vocab file + tokenizer in sync on client side isn't ideal. The following response type after encode() would be more than useful:

{ 
  'tokens': [ 
      [ "[CLS]" "The" "horse" "jump" "##ed" "over" "the" "fe" "##nce" "." "[SEP]" ],
      [ "[CLS]" "I" "destroyed" "there" "." "[SEP]" ] ],
  'emb': ndarray(...)
}

2) Be able to launch a server with several pooling strategies. At the moment if we want to serve different strategies we need to launch several servers, and while it's not a problem for large scale deployment when there's limited resource it can be tricky. Was wondering if it was difficult to allow the server to return the embeddings through several define strategies, for example [ NONE -1 -2 -3 -4, REDUCE_MEAN -2, REDUCE_MAX -1 -2]?

Looking forward to your thoughts on these suggestions.

hanxiao commented 5 years ago

Both are valid and reasonable requests.

Regarding the first one, the major challenge would be the API redesign. Let me see what I can do.

For the second, are you suggesting moving/adding pooling_strategy to the client side API? Note that the combination of the pooling strategy can be quite a lot, and they all need to be frozen into one multi-output graph (each output emits one pooling strategy). The client could select which output to actually use when calling encode(). I can expect that API redesign (both server and client) and frozen graph optimization would be the major work.

hanxiao commented 5 years ago

Related #224 #144

ironflood commented 5 years ago

Thanks for the quick response!

About #1) I agree it changes quite a lot the API. Of course there's different ways to do it. Maybe the easiest way to not break legacy code and keep the same number of methods would be to add a server start parameter allowing us to enable returning an object with both embeddings & tokens?

About #2) no I wasn't suggesting to move the pooling strategy to the client side. Client side would be the same. I was only suggesting to be able to start the server with mixed pooling strategies defined during server start, the server would be returning always the same sequence of defined pooling strategies instead of returning only one type of embeddings (right now). The response object could contain a list of embeddings as they have been declared when starting the server. So if we combine #1 and #2 a possible returned message from server would be (with server started with 4 different pooling strategies):

{ 
  'tokens': [ 
      [ "[CLS]" "The" "horse" "jump" "##ed" "over" "the" "fe" "##nce" "." "[SEP]" ],
      [ "[CLS]" "I" "destroyed" "there" "." "[SEP]" ] ],
  'embs': [ 
      ndarray(...),         # shape (batch,sent_num,tok_emb)
      ndarray(...),         # shape (batch,sent_emb)
      ndarray(...),         # shape (batch,sent_emb)
      ndarray(...)          # shape (batch,sent_emb)
}

hanxiao commented 5 years ago

part 1 done in #226

hanxiao commented 5 years ago

fyi, this feature (retrieving tokenization from server) is implemented in #226 and is available since 1.8.0. Please do

pip install -U bert-serving-server bert-serving-client

for the update. Usage can be found in the tutorial https://github.com/hanxiao/bert-as-service#using-your-own-tokenizer or documentation: https://bert-as-service.readthedocs.io/

ironflood commented 5 years ago

Wow awesome! Testing now.

hanxiao commented 5 years ago

as a follow-up on #226 , it turns out the biggest challenge wasn't API, it's the communication and memory overhead when adding extra information (e.g. tokenization, multiple embeddings), hurting its overall scalability. This becomes very serious when max_seq_len and client batch size is high. Hence there are some effort in #226 on refactoring BertSink to make it more efficient.

ironflood commented 5 years ago

Tokenization is a very lightweight extra information right? If that is the only extra info it shouldn't impact the sink much, am I mistaken?

However when trying to embed sentences with show_tokens=True the embedding takes forever compared to before. Is it normal?

Also new issue (should I start a new one?) since I moved to 1.8 on server and client: when starting the server without the param -show_tokens_to_client I get a warning that show_tokens=True even though I ask to encode without it as shown here using python console:

Server side (2x P100 GPUs):

bert-serving-start -model_dir ~/um/uncased_L-12_H-768_A-12/ -pooling_layer -1 -2 -3 -4 -pooling_strategy NONE -num_worker=2 -max_seq_len=40 -mask_cls_sep -max_batch_size=256 -gpu_memory_fraction=1.0

Client side:

>>> from bert_serving.client import BertClient
>>> bc = BertClient(ip='10.164.0.5')
>>> encoded = bc.encode(['hello world!', 'thisis it'])
/home/ubuntu/p36-torch100/lib/python3.6/site-packages/bert_serving/client/__init__.py:285: UserWarning: "show_tokens=True", but the server does not support showing tokenization info to clients.
here is what you can do:
- start a new server with "bert-serving-start -show_tokens_to_client ..."
- or, use "encode(show_tokens=False)"
  warnings.warn('"show_tokens=True", but the server does not support showing tokenization info to clients.\n'

Notes: updated the server & client to 1.8

hanxiao commented 5 years ago

Warning issue is fixed in #228 do pip install -U bert-serving-server bert-serving-client for the update.

Regarding the speed, no that's not normal. Here is a simple benchmark showing there is not significant overhead with -show_tokens_to_client. You may reproduce it by yourself with example/example1.py

ironflood commented 5 years ago

Thanks for the blazing fast reply. I can't reproduce anymore the 10x slow down I experienced before since I moved to 1.8.1 so please disregard my comment.

The fact that we can now get the tokens back in the response opens up new interesting uses of your embedding server, such as NER (like in the paper using last 4 layer feature extraction) or Q&A to name a few applications.

Was wondering what was your point of view regarding suggestion 2) (ability to start with mix strategies)? Right now my work around is to start the server without pooling & reduce as needed on client side - while this is very flexible it puts lots of pressure on networking. Token level embeddings quickly saturate the bandwidth and become the bottleneck.

Also since moving away from 1.7.9 I'm facing multiprocess issues with pyTorch data loader when using several workers but I'll better open a new issue about it.

Thanks again!

jina-ai / clip-as-service

Feature request - return tokens and multiple strategies #225