clip_server lost connection after running for a while

learningpro commented 1 year ago

It can start normally, but after running for a while, it lost connections.

DEBUG  clip_t/rep-0@6022 start listening on 0.0.0.0:54630
DEBUG  clip_t/rep-0@6019 ready and listening                                          [10/28/22 23:32:24]
────────────────────────────────────── 🎉 Flow is ready to serve! ───────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓      Protocol                   GRPC  │
│  🏠        Local          0.0.0.0:51000  │
│  🔒      Private    192.168.31.58:51000  │
╰──────────────────────────────────────────╯
DEBUG  Flow@6019 2 Deployments (i.e. 2 Pods) are running in this Flow                 [10/28/22 23:32:24]
DEBUG  clip_t/rep-0@6022 got an endpoint discovery request                                                                                [10/28/22 23:37:35]
DEBUG  clip_t/rep-0@6022 recv DataRequest at /rank with id: 644c5f98f0034283bf9334718ec4295c
UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:178.) (raised from /opt/homebrew/lib/python3.9/site-packages/torchvision/transforms/functional.py:150)
DEBUG  gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 1/3. Trying next replica, if    [10/28/22 23:37:35]
       available.
DEBUG  gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 2/3. Trying next replica, if
       available.
DEBUG  gateway/rep-0/GatewayRuntime@6023 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 3/3. Trying next replica, if
       available.
DEBUG  gateway/rep-0/GatewayRuntime@6023 GRPC call failed, retries exhausted
DEBUG  gateway/rep-0/GatewayRuntime@6023 resetting connection to 0.0.0.0:54630
ERROR  gateway/rep-0/GatewayRuntime@6023 Error while getting responses from deployments: failed to connect to all addresses; last error:
       UNKNOWN: Failed to connect to remote host: Connection refused |Gateway: Communication error with deployment clip_t at address(es)
       {'0.0.0.0:54630'}. Head or worker(s) may be down.

pip3 show clip_server
Name: clip-server
Version: 0.8.0
Summary: Embed images and sentences into fixed-length vectors via CLIP
Home-page: https://github.com/jina-ai/clip-as-service
Author: Jina AI
Author-email: hello@jina.ai
License: Apache 2.0
Location: /opt/homebrew/lib/python3.9/site-packages
Requires: ftfy, jina, open-clip-torch, prometheus-client, regex, torch, torchvision
Required-by:

host on : Macbook Pro M1

161424 commented 1 year ago

Because my Docker cluster cannot be connected to the Internet, I downloaded ViT-B-32.pt from the local image and then uploaded it to the Docker cluster. However, the cluster container cannot find the option to continue downloading ViT, but the relative location in the cluster contains ViT

My problem is solved. The main reason is that the program cannot obtain the model from the root path with the model because of the change of the environment variable "root"

numb3r3 commented 1 year ago

@learningpro How do you start the server? via local CLI python -m clip_server or k8s?

ZiniuYu commented 1 year ago

@learningpro Could you provide more details on this problem? Like the YAML file you use, steps to reproduce, etc. Thanks!

kaushikb11 commented 1 year ago

@numb3r3 @ZiniuYu facing the same issue. Machine: Mac M1 Pro

Command

python -m clip_server

kaushikb11 commented 1 year ago

❯ python3 -m clip_server                                       search-app 12:27:07
────────────────────────────────────────────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓      Protocol                   GRPC  │
│  🏠        Local          0.0.0.0:51000  │
│  🔒      Private     192.168.1.47:51000  │
│  🌍       Public             None:51000  │
╰──────────────────────────────────────────╯
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:31:35]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.                                              
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:31:39]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.                                              
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:32:15]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.                                              
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:32:21]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.                                              
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:32:23]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.                                              
ERROR  gateway/rep-0/GatewayRuntime@5462 Error while getting responses from deployments: failed to connect to all addresses |Gateway:        [11/12/22 12:35:01]
       Communication error with deployment clip_t at address(es) {'0.0.0.0:59354'}. Head or worker(s) may be down.

kaushikb11 commented 1 year ago

Client code

from clip_client import Client

client = Client('grpc://0.0.0.0:51000')

r = client.encode(['she smiled, with pain', 'https://clip-as-service.jina.ai/_static/favicon.png'])

print(r)

JoanFM commented 1 year ago

@kaushikb11, have u tried sending first only text and later only images?

kaushikb11 commented 1 year ago

yes, it worked the initial time with

r = client.encode(['she smiled, with pain'])

but not with two strings of text

r = client.encode(['she smiled, with pain', 'what is pain?'])

It failed with a single image as well

r = client.encode(['https://clip-as-service.jina.ai/_static/favicon.png'])

jemmyshin commented 1 year ago

What's the output for r = client.encode(['she smiled, with pain']) and r = client.encode(['she smiled, with pain', 'what is pain?'])? I am wondering why they had different behaviors. @kaushikb11

ZiniuYu commented 1 year ago

Hi @kaushikb11 , What's your output of jina -vf? Can you also try export JINA_LOG_LEVEL=DEBUG and rerun your code?

numb3r3 commented 1 year ago

And what's more, what the pytorch version are you using? BTW, are you running clip_server under rosetta x86?

kaushikb11 commented 1 year ago

@ZiniuYu Here you go

jina -vf                                                                                                          search-app 19:02:13
- jina 3.11.0
- docarray 0.18.1
- jcloud 0.0.36
- jina-hubble-sdk 0.22.2
- jina-proto 0.1.13
- protobuf 3.20.3
- proto-backend python
- grpcio 1.47.2
- pyyaml 6.0
- python 3.8.15
- platform Darwin
- platform-release 22.1.0
- platform-version Darwin Kernel Version 22.1.0: Sun Oct 9 20:15:09 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T6000
- architecture arm64
- processor arm
- uid 55969664184872
- session-id 94442b94-63e7-11ed-ad67-32e773f3b228
- uptime 2022-11-14T12:12:58.912197
- ci-vendor (unset)
- internal False
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_EARLY_STOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_OPTOUT_TELEMETRY (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
* JINA_LOCKS_ROOT (unset)
* JINA_K8S_ACCESS_MODES (unset)
* JINA_K8S_STORAGE_CLASS_NAME (unset)
* JINA_K8S_STORAGE_CAPACITY (unset)

kaushikb11 commented 1 year ago

@numb3r3 PyTorch versions

pip3 freeze | grep torch                                                                                         
open-clip-torch==2.4.1
torch==1.13.0
torchmetrics==0.10.2
torchvision==0.14.0

kaushikb11 commented 1 year ago

What's the output for r = client.encode(['she smiled, with pain']) and r = client.encode(['she smiled, with pain', 'what is pain?'])? I am wondering why they had different behaviors

@jemmyshin I have no idea. The first returned an embedding.

kaushikb11 commented 1 year ago

Let me know if I could help you with anything else. fyi: The system is Mac M1 Pro

ZiniuYu commented 1 year ago

@kaushikb11 The environment looks legit. Can you also please rerun everything with export JINA_LOG_LEVEL=DEBUG and paste the output here?

kaushikb11 commented 1 year ago

Traceback when I run the client

DEBUG  GRPCClient@9945 connected to 0.0.0.0:51000                                                                                            [11/14/22 12:50:55]
Traceback (most recent call last):
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 47, in _arg_wrapper
    return func(*args, **kwargs)
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/clip_client/client.py", line 153, in _gather_result
    results[r[:, 'id']][:, attribute] = r[:, attribute]
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/docarray/array/mixins/getitem.py", line 102, in __getitem__
    elif isinstance(index[0], bool):
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "check.py", line 9, in <module>
    r = client.encode(["She is in pain", "what's pain"])
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/clip_client/client.py", line 295, in encode
    self._client.post(
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/mixin.py", line 271, in post
    return run_async(
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/helper.py", line 1334, in run_async
    return asyncio.run(func(*args, **kwargs))
  File "/opt/homebrew/Cellar/python@3.8/3.8.15/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/homebrew/Cellar/python@3.8/3.8.15/Frameworks/Python.framework/Versions/3.8/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/mixin.py", line 262, in _get_results
    async for resp in c._get_results(*args, **kwargs):
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/base/grpc.py", line 131, in _get_results
    callback_exec(
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 83, in callback_exec
    _safe_callback(on_done, continue_on_error, logger)(response)
  File "/Users/kaushikbokka/apps/search-app/venv/lib/python3.8/site-packages/jina/clients/helper.py", line 49, in _arg_wrapper
    err_msg = f'uncaught exception in callback {func.__name__}(): {ex!r}'
AttributeError: 'functools.partial' object has no attribute '__name__'

Server side

python3 -m clip_server                                                                                            search-app 12:49:45
⠋ Waiting ... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/0 -:--:--DEBUG  gateway/rep-0/GatewayRuntime@9744 adding connection for deployment clip_t/heads/0 to grpc://0.0.0.0:65282                             [11/14/22 12:49:53]
DEBUG  gateway/rep-0/GatewayRuntime@9744 start server bound to 0.0.0.0:51000                                                                                    
DEBUG  gateway/rep-0@9729 ready and listening                                                                                                [11/14/22 12:49:53]
⠼ Waiting clip_t... ━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━ 2/3 0:00:03DEBUG  clip_t/rep-0@9743 <clip_server.executors.clip_torch.CLIPEncoder object at 0x13f4d1dc0> is successfully loaded!                        [11/14/22 12:49:57]
DEBUG  clip_t/rep-0@9743 start listening on 0.0.0.0:65282                                                                                                       
DEBUG  clip_t/rep-0@9729 ready and listening                                                                                                 [11/14/22 12:49:57]
────────────────────────────────────────────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────────────────────────────────────────────
╭────────────── 🔗 Endpoint ───────────────╮
│  ⛓      Protocol                   GRPC  │
│  🏠        Local          0.0.0.0:51000  │
│  🔒      Private    10.191.62.138:51000  │
│  🌍       Public             None:51000  │
╰──────────────────────────────────────────╯
DEBUG  Flow@9729 2 Deployments (i.e. 2 Pods) are running in this Flow                                                                        [11/14/22 12:49:57]
DEBUG  clip_t/rep-0@9743 got an endpoint discovery request                                                                                   [11/14/22 12:50:22]
DEBUG  clip_t/rep-0@9743 recv DataRequest at /encode with id: 4a0fa5aa31ca493e9f316474cb5909a7                                                                  
DEBUG  clip_t/rep-0@9743 recv DataRequest at /encode with id: 1842d6550a384061bea12795770d5cf9                                               [11/14/22 12:50:25]
DEBUG  clip_t/rep-0@9743 recv DataRequest at /encode with id: 766e528658bf4b6f81b0f5ce96631d7c                                               [11/14/22 12:50:55]
DEBUG  gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 1/3. Trying next replica, if       [11/14/22 12:50:55]
       available.                                                                                                                                               
DEBUG  gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 2/3. Trying next replica, if                          
       available.                                                                                                                                               
DEBUG  gateway/rep-0/GatewayRuntime@9744 GRPC call failed with code StatusCode.UNAVAILABLE, retry attempt 3/3. Trying next replica, if                          
       available.                                                                                                                                               
DEBUG  gateway/rep-0/GatewayRuntime@9744 GRPC call failed, retries exhausted                                                                                    
DEBUG  gateway/rep-0/GatewayRuntime@9744 resetting connection to 0.0.0.0:65282                                                                                  
ERROR  gateway/rep-0/GatewayRuntime@9744 Error while getting responses from deployments: failed to connect to all addresses |Gateway:                           
       Communication error with deployment clip_t at address(es) {'0.0.0.0:65282'}. Head or worker(s) may be down.

kaushikb11 commented 1 year ago

@ZiniuYu

numb3r3 commented 1 year ago

@kaushikb11 so far, we cannot reproduce your error on our side (exactly the same envs, including M1 Pro, jina, docarray, pytorch version). We guess this is an upstream issue related to pytorch installation. We just need more time to verify this issue. Of course, any more feedbacks are welcome. I believe our community would also face this problem.

kaushikb11 commented 1 year ago

@numb3r3 Noted! Thanks. Do keep me updated if you have any progress.

vincetrep commented 1 year ago

Hello,

I am running with similar issues on different setups.

I am running clip as a service as well in GRPC mode. The clip-server is running in a docker container.

I have seen this issue on my development environment where the communication stops at some point. This time, it stopped at restart which I have not seen as often.

On my other environments running on kubernetes, every time this has happened I had to redeploy the containers to make them functional again. Do you have any clues as to what could be the source of this problem? Could it be related with management of sockets/communication channels? Is is possible that when peppering the service with too many queued requests it runs out of connections? Let me know how I can help with the troubleshooting.

Here is an error log upon restarting the container on Docker Desktop running on Windows:

Task exception was never retrieved

future: <Task finished name='Task-13' coro=<GatewayRequestHandler.handle_request..gather_endpoints() done, defined at /usr/local/lib/python3.9/site-packages/jina/serve/runtimes/gateway/request_handling.py:54> exception=failed to connect to all addresses |Gateway: Communication error with deployment at address(es) {'0.0.0.0:64294'}. Head or worker(s) may be down.>

Traceback (most recent call last):

File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1068, in task_wrapper

return await connection.send_discover_endpoint(

File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 377, in send_discover_endpoint

await self._init_stubs()

File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 353, in _init_stubs

available_services = await GrpcConnectionPool.get_available_services(

File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1390, in get_available_services

async for res in response:

File "/usr/local/lib/python3.9/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses

await self._raise_for_status()

File "/usr/local/lib/python3.9/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status

raise _create_rpc_error(await self.initial_metadata(), await

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:

status = StatusCode.UNAVAILABLE

details = "failed to connect to all addresses"

debug_error_string = "{"created":"@1677095795.838017200","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1677095795.838016400","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.9/site-packages/jina/serve/runtimes/gateway/request_handling.py", line 68, in gather_endpoints

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:35]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:36]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

raise err

File "/usr/local/lib/python3.9/site-packages/jina/serve/runtimes/gateway/request_handling.py", line 60, in gather_endpoints

endpoints = await asyncio.gather(*tasks_to_get_endpoints)

File "/usr/local/lib/python3.9/site-packages/jina/serve/networking.py", line 1082, in task_wrapper

raise error

jina.excepts.InternalNetworkError: failed to connect to all addresses |Gateway: Communication error with deployment at address(es) {'0.0.0.0:64294'}. Head or worker(s) may be down.

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:38]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:41]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:43]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:45]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:50]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:56:53]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

────────────────────────── 🎉 Flow is ready to serve! ──────────────────────────

╭────────────── 🔗 Endpoint ───────────────╮

│ ⛓ Protocol GRPC │

│ 🏠 Local 0.0.0.0:9100 │

│ 🔒 Private 172.24.0.5:9100 │

│ 🌍 Public 23.233.181.148:9100 │

╰──────────────────────────────────────────╯

╭──────── 💎 Prometheus extension ─────────╮

│ 🔦 clip_t ...:9091 │

│ 🔦 gateway ...:9090 │

╰──────────────────────────────────────────╯

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:57:00]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ERROR gateway/rep-0/GatewayRuntime@22 Error while getting [02/22/23 19:57:01]

   responses from deployments: failed to connect to all                     

   addresses |Gateway: Communication error with                             

   deployment clip_t at address(es) {'0.0.0.0:64294'}.                      

   Head or worker(s) may be down.

ZiniuYu commented 1 year ago

Hi @vincetrep

What's your jina version (jina -vf)? Can you set the env JINA_LOG_LEVEL=debug and see if it prints any more info?

One possible reason is that you are running out of computing resource. We recently fixed a issue in Jina Core that affects the health check latency when a flow is stressed by load in k8s, could you please upgrade to the latest jina and try again?

Could you also provide more details about how the communication stops if possible? Like it stops when you are sending large requests or just stops when the flow is idle, and etc. Anything that may help us debug or reproduce is welcomed.

vincetrep commented 1 year ago

Hi @ZiniuYu ,

apologies for the delay in the answer. The error occurs after a certain period of time running and I am sending batches of combinations of texts and images to clip as a service.

Running latest version of jina: 3.14.2.

Let's take the scenario where we're running out of computing resource. Should there be a recovery mechanism inside of the container to recover from that state? Via a retry mechanism or other?

At the moment, when this occurs, the container ends up in a state where it lost connectivity and doesn't recover from the state even if no more resources are consumed.

If you want to try to replicate, try a local setup on your machine and send a big batch of records like images to encode to get your container in an "resource exhausted state".

amarbakir commented 3 weeks ago

This might be relevant to the IndexError/AttributeError errors seen above: https://github.com/jina-ai/clip-as-service/issues/879#issuecomment-2297111410

jina-ai / clip-as-service

clip_server lost connection after running for a while #850