langchain-ai / langchain-google

MIT License
74 stars 78 forks source link

ResourceExhausted: 429 received metadata size exceeds soft limit #306

Open pweglik opened 1 week ago

pweglik commented 1 week ago

Environment details

Description

We have created a simple Flask server and deployed it to GCP as Cloud Run Service. We are also using few other dependencies:

google-api-core==2.19.0
google-cloud-aiplatform==1.49.0 # we needed to pin this version, because of depraction warning somewhere in ai-vertex, might be worth checking in future
google-cloud-logging==3.10.0
langchain==0.2.1
langchain_community==0.2.1
langchain-google-vertexai==1.0.4

snippet of the code:

import google.cloud.logging
from langchain_google_vertexai import VertexAI

# setup logging
client = google.cloud.logging.Client()
client.setup_logging()

# in the endpoint
llm_model = VertexAI(
        model_name="text-bison",
        max_output_tokens=256,
        temperature=1,
        top_p=0.8,
        top_k=40,
        verbose=True,
    )

We don't do anything more sophisticated than that. After deployment, it ran fine for few hours and then we started to received warnings:

Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 received metadata size exceeds soft limit (16711 vs. 16384);  :path:90B :authority:79B :method:43B :scheme:44B content-type:60B te:42B grpc-accept-encoding:75B user-agent:100B grpc-trace-bin:103B pc-low-fwd-bin:77B x-goog-request-params:148B x-goog-api-client:12052B x-goog-api-client:62B authorization:1076B x-google-gfe-frontline-info:836B x-google-gfe-timestamp-trace:76B x-google-gfe-verified-user-ip:76B x-gfe-signed-request-headers:472B x-google-gfe-location-info:74B x-gfe-ssl:44B x-google-gfe-tls-base64urlclienthelloprotobuf:299B x-user-ip:56B x-google-service:105B x-google-gfe-service-trace:115B x-google-gfe-backend-timeout-ms:71B accept-encoding:56B x-google-peer-delegation-chain-bin:92B x-google-request-uid:138B x-google-dappertraceinfo:111B.

You can see that there are two fields named x-goog-api-client and one is growing out of proportion. Later on it grows even bigger and we started to received it on almost every request. The server also started to timeout as it was unable to serve those requests.

Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 received metadata size exceeds soft limit (27114 vs. 16384);  :path:90B :authority:79B :method:43B :scheme:44B content-type:60B te:42B grpc-accept-encoding:75B user-agent:100B grpc-trace-bin:103B pc-low-fwd-bin:77B x-goog-request-params:148B x-goog-api-client:22452B x-goog-api-client:62B authorization:1076B x-google-gfe-frontline-info:837B x-google-gfe-timestamp-trace:76B x-google-gfe-verified-user-ip:76B x-gfe-signed-request-headers:472B x-google-gfe-location-info:74B x-gfe-ssl:44B x-google-gfe-tls-base64urlclienthelloprotobuf:299B x-user-ip:56B x-google-service:105B x-google-gfe-service-trace:115B x-google-gfe-backend-timeout-ms:71B accept-encoding:56B x-google-peer-delegation-chain-bin:92B x-google-request-uid:140B x-google-dappertraceinfo:111B.

It looks like something is appended to this field and it overflows after some time. I found a place in the copde of the library tha could cause it: https://github.com/googleapis/google-auth-library-python/blob/main/google/auth/metrics.py#L138-L154

I'm looking for some guidance what could cause such warning and overflow in requests.

Steps to reproduce

I'm not really sure, error only occurred after few hours (serving few thousands requests)

I have also created this issue in google auth library repo, but maybe someone here will be able to help.

Let me know if I can help you somehow or provide any additional info!

lkuligin commented 1 week ago

how exactly do you invoke the model?

on a separate note, have you considered switching to Gemini?

pweglik commented 1 week ago

We use load_summarize_chain with either stuff or map_reduce. Then we call invoke() on those chains. We're happy with PaLM2 for our use case here