jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.38k stars 2.06k forks source link

Client hangs .. python server #300

Open nstfk opened 5 years ago

nstfk commented 5 years ago

Prerequisites

Please fill in by replacing [ ] with [x].

System information Running on google colab

Description

I'm using this command to start the server: I am using python the start the server as instructed in the readme

from bert_serving.server.helper import get_args_parser
from bert_serving.server import BertServer
args = get_args_parser().parse_args(['-model_dir', '/content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/',
                                     '-port', '5555',
                                     '-port_out', '5556',
                                     '-max_seq_len', 'NONE',
                                     '-mask_cls_sep',
                                     '-num_worker','

     server = BertServer(args)
     server.start()

and this is the response I get:

I:VENTILATOR:[__i:__i: 66]:freeze, optimize and export graph, could take a while...
I:GRAPHOPT:[gra:opt: 52]:model config: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_config.json
I:GRAPHOPT:[gra:opt: 55]:checkpoint: /content/gdrive/My Drive/MedSentEval/models/cased_L-12_H-768_A-12/bert_model.ckpt
I:GRAPHOPT:[gra:opt: 59]:build graph...
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: [Errno 10] No child processes.  joblib will operate in serial mode
  warnings.warn('%s.  joblib will operate in serial mode' % (e,))

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

I:GRAPHOPT:[gra:opt:128]:load parameters from checkpoint...
I:GRAPHOPT:[gra:opt:132]:optimize...
I:GRAPHOPT:[gra:opt:140]:freeze...
I:GRAPHOPT:[gra:opt:145]:write graph to a tmp file: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:__i: 74]:optimized graph is stored at: /tmp/tmp3xjwfob5
I:VENTILATOR:[__i:_ru:128]:bind all sockets
I:VENTILATOR:[__i:_ru:132]:open 8 ventilator-worker sockets
I:VENTILATOR:[__i:_ru:135]:start the sink
I:SINK:[__i:_ru:303]:ready 

`

and calling the server via:

from bert_serving.client import BertClient
bc = BertClient(port=18888, port_out=18889, timeout=10000)
bc.encode(['First do it', 'then do it right', 'then do it better'])

Nothing happens, it hangs there forever!

wliu-sift commented 5 years ago

Observing similar issue on AWS p3 instance. The service freezes after receiving a request. Apparently the service hangs on some threading issue..

I:VENTILATOR:[__i:_ru:216]:terminated!
Traceback (most recent call last):
  File "/home/hadoop/venv/bin/bert-serving-start", line 10, in <module>
    sys.exit(main())
  File "/home/hadoop/venv/local/lib/python3.6/dist-packages/bert_serving/server/cli/__init__.py", line 5, in main
    server.join()
  File "/usr/lib64/python3.6/threading.py", line 1056, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Still haven't figure out why, but apparently this problem is only transient and only happens at start-up time. If the first encoding task goes through then everything else goes through.

FredericoCoelhoNunes commented 5 years ago

I'm having the same problem, but I can't make the first encoding task go through. I'm looking to use Colaboratory as a one time thing for this project, just to generate a large amount of sentence embeddings since I have no access to a GPU.

I have tried setting the ignore_all_checks=True parameter when starting the client, but that doesn't work either (it gets stuck when i try to .encode).

Any help would be really appreciated! Thanks.

iislucas commented 5 years ago

I've noticed the same issue, was quite easy to recreate on a google cloud instance; here's the details I was using which may help reproduce:

System information

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      0MiB / 16130MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

After I start the instance, I send it a few queries, all further client requests hang. I suspect a race condition in the server.

iislucas commented 5 years ago

I tried v1.8.1 and found the same issue; that after a few queries, and a minute or so, the server becomes unresponsive and the old client and new clients never get the responses from the encode function.

The last logs from the server look like so:

I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:WORKER-0:[__i:_ru:468]:job done       size: (1, 768)  client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back  size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#10'
^[[AI:VENTILATOR:[__i:_ru:164]:new encode request       req id: 11      size: 1 client: b'cb057fa0-40e5-46a1-8014-4
f7e5f23d218'
I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'

Occasionally, on the client side, instead of just hanging, I see errors like this (the client is running locally using python 3.7, while server is using python 3.5):

Traceback (most recent call last):
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2328, in __call__
    return self.wsgi_app(environ, start_response)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2314, in wsgi_app
    response = self.handle_exception(e)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1760, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 2311, in wsgi_app
    response = self.full_dispatch_request()
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1834, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1737, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/_compat.py", line 36, in reraise
    raise value
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1832, in full_dispatch_request
    rv = self.dispatch_request()
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/flask/app.py", line 1818, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "Users/iislucas/index-server/index_server.py", line 42, in embedding
    embedding = bc.encode([obj['text']])
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 202, in arg_wrapper
    return func(self, *args, **kwargs)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 283, in encode
    r = self._recv_ndarray(req_id)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 166, in _recv_ndarray
    request_id, response = self._recv(wait_for_req_id)
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 160, in _recv
    raise e
  File "Users/iislucas/index-server/.pyenv/lib/python3.7/site-packages/bert_serving/client/__init__.py", line 150, in _recv
    request_id = int(response[-1])
ValueError: invalid literal for int() with base 10: b'{"shape":[1,768],"dtype":"float32","tokens":""}

And the corresponding log on the server is:

I:SINK:[__i:_ru:312]:job register       size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:gen:492]:new job        socket: 0       size: 1 client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#12'
I:WORKER-0:[__i:_ru:468]:job done       size: (1, 768)  client: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'
I:SINK:[__i:_ru:292]:collect b'EMBEDDINGS' b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11' (E:1/T:0/A:1)
I:SINK:[__i:_ru:301]:send back  size: 1 job id: b'cb057fa0-40e5-46a1-8014-4f7e5f23d218#11'

Are requests somehow getting out of sync?

4everlove commented 5 years ago

I have a similar issue. If I terminate the server, I also have the same exception on threading. As for the sample code to trigger this hang, if you cancel encode by Ctrl+C and rerun it, you can see the SINK sent back the result of first request, but the new encode hangs again.

4everlove commented 5 years ago

Could you folks check which version of libzmq your machine is using while encountering this issue?

asankasan commented 4 years ago

I'm having the same issue. Was anybody able to find a solution to this?

wliu-sift commented 4 years ago

My workaround is wait for the server to fully booted up and add some delay ~10s. The issue never appears again.

asankasan commented 4 years ago

@wliu-sift Thanks for the response. To where do you add the delay? To the client?

mjangid commented 4 years ago

Hi Team, I am also getting same issue - do you have any update on this?

bigrig2212 commented 4 years ago

same issue here. wondering if it's keeping sockets open. looks like the socket list grows pretty quick and doesnt come back down. here's the health check from right before it went down:

{"ckpt_name":"bert_model.ckpt","client":"1b8cb6cd-3b2f-4ade-80f8-e8eb01298c14","config_name":"bert_config.json","cors":"*","cpu":false,"device_map":[],"fixed_embed_length":false,"fp16":false,"gpu_memory_fraction":0.5,"graph_tmp_dir":null,"http_max_connect":10,"http_port":8080,"mask_cls_sep":false,"max_batch_size":256,"max_seq_len":25,"model_dir":"models/uncased_L-12_H-768_A-12","num_concurrent_socket":30,"num_process":17,"num_worker":15,"pooling_layer":[-2],"pooling_strategy":2,"port":5555,"port_out":5556,"prefetch_size":10,"priority_batch_size":16,"python_version":"3.6.3 (default, Jul 9 2019, 08:50:08) \n[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)]","pyzmq_version":"19.0.1","server_current_time":"2020-05-18 13:32:43.888630","server_start_time":"2020-05-18 13:15:37.862617","server_version":"1.8.9","show_tokens_to_client":false,"statistic":{"avg_last_two_interval":167.3810899715,"avg_request_per_client":12.0,"avg_request_per_second":0.034329692637323224,"avg_size_per_request":2.0,"max_last_two_interval":517.0026954719999,"max_request_per_client":12,"max_request_per_second":0.10548447649600348,"max_size_per_request":3,"min_last_two_interval":9.480067903999952,"min_request_per_client":12,"min_request_per_second":0.0019342258923564133,"min_size_per_request":1,"num_active_client":0,"num_data_request":4,"num_max_last_two_interval":1,"num_max_request_per_client":1,"num_max_request_per_second":1,"num_max_size_per_request":1,"num_min_last_two_interval":1,"num_min_request_per_client":1,"num_min_request_per_second":1,"num_min_size_per_request":1,"num_sys_request":8,"num_total_client":1,"num_total_request":12,"num_total_seq":8},"status":200,"tensorflow_version":["1","11","0"],"tuned_model_dir":null,"ventilator -> worker":["ipc://tmp3fYAAK/socket","ipc://tmpG06X52/socket","ipc://tmpAZqmBl/socket","ipc://tmpYIBL6D/socket","ipc://tmp8EubCW/socket","ipc://tmpzm5B7e/socket","ipc://tmpZmr3Cx/socket","ipc://tmpAWuv8P/socket","ipc://tmpJWeYD8/socket","ipc://tmpPGVr9q/socket","ipc://tmpLelWEJ/socket","ipc://tmpBTtra2/socket","ipc://tmpfmwXFk/socket","ipc://tmpFQ0ubD/socket","ipc://tmpeKZ3GV/socket","ipc://tmp0nSDce/socket","ipc://tmpuJCeIw/socket","ipc://tmpYDhQdP/socket","ipc://tmppWMsJ7/socket","ipc://tmpQO75eq/socket","ipc://tmpLgnKKI/socket","ipc://tmpAKzpg1/socket","ipc://tmpDHC5Lj/socket","ipc://tmphWYMhC/socket","ipc://tmpPjGvNU/socket","ipc://tmpTikfjd/socket","ipc://tmpYdRZOv/socket","ipc://tmpZIfLkO/socket","ipc://tmpUovxQ6/socket","ipc://tmp8TDkmp/socket"],"ventilator <-> sink":"ipc://tmpcsCe5r/socket","verbose":false,"worker -> sink":"ipc://tmpKaNt2G/socket","xla":false,"zmq_version":"4.3.2"}

Yuiard commented 3 years ago

I had the same issue. It is perplexing that if I open 2 client programs only one of them will get stuck. If I send query1 from client1 and it gets stuck. Then I send query2 from client2, client1 will receive the result and client2 get stuck. Very similar to what 4everlove described here. [(https://github.com/hanxiao/bert-as-service/issues/387)]