marieai / marie-ai

Integrate AI-powered Document Analysis Pipelines
MIT License
57 stars 3 forks source link

gRPC error: StatusCode.UNAVAILABLE keepalive watchdog timeout #87

Open gregbugaj opened 11 months ago

gregbugaj commented 11 months ago

After service runs for extended amount of time we start receiving following error

ERROR  065301be-28ca-7efa-8000-9b486b83e4ea : AsyncGRPCClient@1192118 gRPC error: StatusCode.UNAVAILABLE keepalive watchdog timeout                                                             [10/18/23 12:55:14]
       None                                                                                                                                                                                                        
       The ongoing request is terminated as the server is not available or closed already.                                                                                                                         
ERROR  065301be-28ca-7efa-8000-9b486b83e4ea : MARIE@1192118 Error: gRPC error: StatusCode.UNAVAILABLE keepalive watchdog timeout                                                                [10/18/23 12:55:14]
       None                                                                                                                                                                                                        
       Traceback (most recent call last):                                                                                                                                                                          
         File "/home/gbugaj/dev/marieai/marie-ai/marie/clients/base/grpc.py", line 142, in _get_results                                                                                                            
           async for response in stream_rpc.stream_rpc_with_retry():                                                                                                                                               
         File "/home/gbugaj/dev/marieai/marie-ai/marie/clients/base/stream_rpc.py", line 51, in stream_rpc_with_retry                                                                                              
           async for resp in stub.Call(                                                                                                                                                                            
         File "/home/gbugaj/environments/pytorch2-3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 326, in _fetch_stream_responses                                                                       
           await self._raise_for_status()                                                                                                                                                                          
         File "/home/gbugaj/environments/pytorch2-3.10/lib/python3.10/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status                                                                             
           raise _create_rpc_error(await self.initial_metadata(), await                                                                                                                                            
       grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:                                                                                                                                       
               status = StatusCode.UNAVAILABLE                                                                                                                                                                     
               details = "keepalive watchdog timeout"                                                                                                                                                              
               debug_error_string = "UNKNOWN:Error received from peer ipv4:0.0.0.0:52000 {created_time:"2023-10-18T12:55:14.148968042-05:00", grpc_status:14, grpc_message:"keepalive watchdog                     
       timeout"}"                                                                                                                                                                                                  
       >                                                                                                                                                                                                           

       During handling of the above exception, another exception occurred:                                                                                                                                         

       Traceback (most recent call last):                                                                                                                                                                          
         File "/home/gbugaj/dev/marieai/marie-ai/marie_server/rest_extension.py", line 304, in process_document_request                                                                                            
           async for resp in client.post(                                                                                                                                                                          
         File "/home/gbugaj/dev/marieai/marie-ai/marie/clients/mixin.py", line 497, in post                                                                                                                        
           async for result in c._get_results(                                                                                                                                                                     
         File "/home/gbugaj/dev/marieai/marie-ai/marie/clients/base/grpc.py", line 169, in _get_results                                                                                                            
           await self._handle_error_and_metadata(err)                                                                                                                                                              
         File "/home/gbugaj/dev/marieai/marie-ai/marie/clients/base/grpc.py", line 188, in _handle_error_and_metadata                                                                                              
           raise ConnectionError(msg)                                                                                                                                                                              
       ConnectionError: gRPC error: StatusCode.UNAVAILABLE keepalive watchdog timeout      
fuseraft commented 9 months ago

What is the maxAttempts set to in the grpc backoff strategy?

Based on the docs here, the max attempts value includes the original request itself.

It looks like it is currently being defaulted to 1 based on looking at the AsyncPostMixin, which means no retries after the initial attempt. Bumping this value might fix this.