pinecone-io / pinecone-python-client

The Pinecone Python client
https://www.pinecone.io/docs
Apache License 2.0
284 stars 78 forks source link

gRPC: Allow retries of up to MAX_MSG_SIZE #347

Closed daverigby closed 3 months ago

daverigby commented 3 months ago

Problem

gRPC has a built-in retry mechanism1 which we configure to automatically retry on status UNAVAILABLE messages from Pinecone.

However, it has been observed that VectorService/Upsert method is not being retried automatically and causes an exception to be thrown to the application:

Traceback (most recent call last):
  File ".venv/lib/python3.11/site-packages/pinecone/grpc/base.py", line 150, in wrapped
return func(
       ^^^^^
  File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib64/python3.11/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "unavailable"
    debug_error_string = "UNKNOWN:Error received from peer ipv4:34.223.120.220:443 {created_time:"2024-05-10T11:54:43.047741403+00:00", grpc_status:14, grpc_message:"unavailable"}"

Enabling gRPC's tracing2 by setting env vars 'GRPC_VERBOSITY=debug GRPC_TRACE=all' (warning - this is very verbose!) highlighted that when we do get an StatusCode.UNAVAILABLE, retry is not considered as the request is too large ("committing" in this context means it effectively disables retry attempts):

0514 14:00:43.870499051 4093173 retry_filter_legacy_call_data.cc:1855] chand=0x7ff708006080 calld=0x56377b0b11e0: exceeded retry buffer size, committing

As per gRPC's options3, the max buffer size is controlled via:

/** Per-RPC retry buffer size, in bytes. Default is 256 KiB. */
#define GRPC_ARG_PER_RPC_RETRY_BUFFER_SIZE "grpc.per_rpc_retry_buffer_size"

Given Upsert messages are frequently larger than 256KiB (it is common to batch up to the 2 MB limit), we will fail to retry any batches larger than 256kB.

Solution

Address this by changing the retry buffer size to the same size as the maximum message we support (currently 128MB, more than sufficient to retry any UpsertRequest).

Type of Change

Test Plan

No existing test infra to automate testing of this (no way to do error injection); manually verified that previously seen (intermittent) UNAVAILABLE responses are correctly retried.