Model 'text-unicorn' returns error "ResourceExhausted: 429 Quota exceeded" with langchain-google-vertexai versions >= 1.0.4

maximilienroberti commented 4 months ago

While running the following script:

from langchain_google_vertexai import VertexAI
llm = VertexAI(model_name="text-unicorn@001")
llm.invoke("Hello, what is your name?")

It prints the following:

Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 4.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 8.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..
Retrying langchain_google_vertexai.llms._completion_with_retry.<locals>._completion_with_retry_inner in 10.0 seconds as it raised ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai..

Then returns the error:

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
File ~/path/to/venv/lib64/python3.11/site-packages/google/api_core/grpc_helpers.py:76, in _wrap_unary_errors.<locals>.error_remapped_callable(*args, **kwargs)
     75 try:
---> 76     return callable_(*args, **kwargs)
     77 except grpc.RpcError as exc:

File ~/path/to/venv/lib64/python3.11/site-packages/grpc/_channel.py:1181, in _UnaryUnaryMultiCallable.__call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
   1175 (
   1176     state,
   1177     call,
   1178 ) = self._blocking(
   1179     request, timeout, metadata, credentials, wait_for_ready, compression
   1180 )
-> 1181 return _end_unary_response_blocking(state, call, False, None)

File ~/path/to/venv/lib64/python3.11/site-packages/grpc/_channel.py:1006, in _end_unary_response_blocking(state, call, with_call, deadline)
   1005 else:
-> 1006     raise _InactiveRpcError(state)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.RESOURCE_EXHAUSTED
    details = "Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai."
    debug_error_string = "UNKNOWN:Error received from peer ipv4:199.36.153.11:443 {created_time:"2024-07-05T07:30:56.642365345+00:00", grpc_status:8, grpc_message:"Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai."}"
>

The above exception was the direct cause of the following exception:

ResourceExhausted                         Traceback (most recent call last)
Cell In[1], line 3
      1 from module_name import VertexAI
      2 llm = VertexAI(model_name="text-unicorn@001")
----> 3 llm.invoke("Hello, what is your name?")

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_core/language_models/llms.py:276, in BaseLLM.invoke(self, input, config, stop, **kwargs)
    266 def invoke(
    267     self,
    268     input: LanguageModelInput,
   (...)
    272     **kwargs: Any,
    273 ) -> str:
    274     config = ensure_config(config)
    275     return (
--> 276         self.generate_prompt(
    277             [self._convert_input(input)],
    278             stop=stop,
    279             callbacks=config.get("callbacks"),
    280             tags=config.get("tags"),
    281             metadata=config.get("metadata"),
    282             run_name=config.get("run_name"),
    283             run_id=config.pop("run_id", None),
    284             **kwargs,
    285         )
    286         .generations[0][0]
    287         .text
    288     )

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_core/language_models/llms.py:633, in BaseLLM.generate_prompt(self, prompts, stop, callbacks, **kwargs)
    625 def generate_prompt(
    626     self,
    627     prompts: List[PromptValue],
   (...)
    630     **kwargs: Any,
    631 ) -> LLMResult:
    632     prompt_strings = [p.to_string() for p in prompts]
--> 633     return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_core/language_models/llms.py:803, in BaseLLM.generate(self, prompts, stop, callbacks, tags, metadata, run_name, run_id, **kwargs)
    788 if (self.cache is None and get_llm_cache() is None) or self.cache is False:
    789     run_managers = [
    790         callback_manager.on_llm_start(
    791             dumpd(self),
   (...)
    801         )
    802     ]
--> 803     output = self._generate_helper(
    804         prompts, stop, run_managers, bool(new_arg_supported), **kwargs
    805     )
    806     return output
    807 if len(missing_prompts) > 0:

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_core/language_models/llms.py:670, in BaseLLM._generate_helper(self, prompts, stop, run_managers, new_arg_supported, **kwargs)
    668     for run_manager in run_managers:
    669         run_manager.on_llm_error(e, response=LLMResult(generations=[]))
--> 670     raise e
    671 flattened_outputs = output.flatten()
    672 for manager, flattened_output in zip(run_managers, flattened_outputs):

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_core/language_models/llms.py:657, in BaseLLM._generate_helper(self, prompts, stop, run_managers, new_arg_supported, **kwargs)
    647 def _generate_helper(
    648     self,
    649     prompts: List[str],
   (...)
    653     **kwargs: Any,
    654 ) -> LLMResult:
    655     try:
    656         output = (
--> 657             self._generate(
    658                 prompts,
    659                 stop=stop,
    660                 # TODO: support multiple run managers
    661                 run_manager=run_managers[0] if run_managers else None,
    662                 **kwargs,
    663             )
    664             if new_arg_supported
    665             else self._generate(prompts, stop=stop)
    666         )
    667     except BaseException as e:
    668         for run_manager in run_managers:

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_google_vertexai/llms.py:231, in VertexAI._generate(self, prompts, stop, run_manager, stream, **kwargs)
    229     generations.append([generation])
    230 else:
--> 231     res = _completion_with_retry(
    232         self,
    233         [prompt],
    234         stream=should_stream,
    235         is_gemini=self._is_gemini_model,
    236         run_manager=run_manager,
    237         **params,
    238     )
    239     if self._is_gemini_model:
    240         usage_metadata = res.to_dict().get("usage_metadata")

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_google_vertexai/llms.py:70, in _completion_with_retry(llm, prompt, stream, is_gemini, run_manager, **kwargs)
     67         return llm.client.predict(prompt[0], **kwargs)
     69 with telemetry.tool_context_manager(llm._user_agent):
---> 70     return _completion_with_retry_inner(prompt, is_gemini, **kwargs)

File ~/path/to/venv/lib64/python3.11/site-packages/tenacity/__init__.py:336, in BaseRetrying.wraps.<locals>.wrapped_f(*args, **kw)
    334 copy = self.copy()
    335 wrapped_f.statistics = copy.statistics  # type: ignore[attr-defined]
--> 336 return copy(f, *args, **kw)

File ~/path/to/venv/lib64/python3.11/site-packages/tenacity/__init__.py:475, in Retrying.__call__(self, fn, *args, **kwargs)
    473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs)
    474 while True:
--> 475     do = self.iter(retry_state=retry_state)
    476     if isinstance(do, DoAttempt):
    477         try:

File ~/path/to/venv/lib64/python3.11/site-packages/tenacity/__init__.py:376, in BaseRetrying.iter(self, retry_state)
    374 result = None
    375 for action in self.iter_state.actions:
--> 376     result = action(retry_state)
    377 return result

File ~/path/to/venv/lib64/python3.11/site-packages/tenacity/__init__.py:418, in BaseRetrying._post_stop_check_actions.<locals>.exc_check(rs)
    416 retry_exc = self.retry_error_cls(fut)
    417 if self.reraise:
--> 418     raise retry_exc.reraise()
    419 raise retry_exc from fut.exception()

File ~/path/to/venv/lib64/python3.

11/site-packages/tenacity/__init__.py:185, in RetryError.reraise(self)
    183 def reraise(self) -> t.NoReturn:
    184     if self.last_attempt.failed:
--> 185         raise self.last_attempt.result()
    186     raise self

File /usr/lib64/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout)
    447     raise CancelledError()
    448 elif self._state == FINISHED:
--> 449     return self.__get_result()
    451 self._condition.wait(timeout)
    453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /usr/lib64/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self)
    399 if self._exception:
    400     try:
--> 401         raise self._exception
    402     finally:
    403         # Break a reference cycle with the exception in self._exception
    404         self = None

File ~/path/to/venv/lib64/python3.11/site-packages/tenacity/__init__.py:478, in Retrying.__call__(self, fn, *args, **kwargs)
    476 if isinstance(do, DoAttempt):
    477     try:
--> 478         result = fn(*args, **kwargs)
    479     except BaseException:  # noqa: B902
    480         retry_state.set_exception(sys.exc_info())  # type: ignore[arg-type]

File ~/path/to/venv/lib64/python3.11/site-packages/langchain_google_vertexai/llms.py:58, in _completion_with_retry.<locals>._completion_with_retry_inner(prompt, is_gemini, **kwargs)
     53 @retry_decorator
     54 def _completion_with_retry_inner(
     55     prompt: List[Union[str, Image]], is_gemini: bool = False, **kwargs: Any
     56 ) -> Any:
     57     if is_gemini:
---> 58         return llm.client.generate_content(
     59             prompt,
     60             stream=stream,
     61             safety_settings=kwargs.pop("safety_settings", None),
     62             generation_config=kwargs,
     63         )
     64     else:
     65         if stream:

File ~/path/to/venv/lib64/python3.11/site-packages/vertexai/generative_models/_generative_models.py:524, in _GenerativeModel.generate_content(self, contents, generation_config, safety_settings, tools, tool_config, stream)
    516     return self._generate_content_streaming(
    517         contents=contents,
    518         generation_config=generation_config,
   (...)
    521         tool_config=tool_config,
    522     )
    523 else:
--> 524     return self._generate_content(
    525         contents=contents,
    526         generation_config=generation_config,
    527         safety_settings=safety_settings,
    528         tools=tools,
    529         tool_config=tool_config,
    530     )

File ~/path/to/venv/lib64/python3.11/site-packages/vertexai/generative_models/_generative_models.py:613, in _GenerativeModel._generate_content(self, contents, generation_config, safety_settings, tools, tool_config)
    588 """Generates content.
    589 
    590 Args:
   (...)
    604     A single GenerationResponse object
    605 """
    606 request = self._prepare_request(
    607     contents=contents,
    608     generation_config=generation_config,
   (...)
    611     tool_config=tool_config,
    612 )
--> 613 gapic_response = self._prediction_client.generate_content(request=request)
    614 return self._parse_response(gapic_response)

File ~/path/to/venv/lib64/python3.11/site-packages/google/cloud/aiplatform_v1beta1/services/prediction_service/client.py:2287, in PredictionServiceClient.generate_content(self, request, model, contents, retry, timeout, metadata)
   2284 self._validate_universe_domain()
   2286 # Send the request.
-> 2287 response = rpc(
   2288     request,
   2289     retry=retry,
   2290     timeout=timeout,
   2291     metadata=metadata,
   2292 )
   2294 # Done; return the response.
   2295 return response

File ~/path/to/venv/lib64/python3.11/site-packages/google/api_core/gapic_v1/method.py:131, in _GapicCallable.__call__(self, timeout, retry, compression, *args, **kwargs)
    128 if self._compression is not None:
    129     kwargs["compression"] = compression
--> 131 return wrapped_func(*args, **kwargs)

File ~/path/to/venv/lib64/python3.11/site-packages/google/api_core/grpc_helpers.py:78, in _wrap_unary_errors.<locals>.error_remapped_callable(*args, **kwargs)
     76     return callable_(*args, **kwargs)
     77 except grpc.RpcError as exc:
---> 78     raise exceptions.from_grpc_error(exc) from exc

ResourceExhausted: 429 Quota exceeded for aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model with base model: text-unicorn. Please submit a quota increase request. https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai.

Everything works fine for the bison and gemini models. And the requests per minute are far from the default 60 req/m. This issue started to appear on langchain-google-vertexai versions >= 1.0.4.

jjaeggli commented 3 months ago

I encountered the same issue with mistral-nemo but didn't have any problems when using gemini-1.5-flash-001 using local json credentials. For the same project, I was able to use the default mistral-nemo example notebook in Colab Enterprise without getting the quota error, but this is with using my personal admin account credentials and not the role account (using google.auth.transport.requests and not langchain).

langchain 0.2.5 langchain-core 0.2.29 langchain-google-vertexai 1.0.8 langchain-openai 0.1.8 langchain-text-splitters 0.2.1

gawbul commented 2 months ago

@lkuligin has this been resolved in a new release?

netanelm-upstream commented 1 month ago

@gawbul @jjaeggli Any workaround for that?

gawbul commented 1 month ago

@netanelm-upstream nothing from my side, as of yet, other than using a different model. Was hoping to get some response on this from langchain, though 🤔

@maximilienroberti did you get a work around for this?

gawbul commented 1 month ago

@lkuligin can this be reopened, please, as we are still experiencing this issue with no resolution that I can see. We are definitely not exceeding our quota limit.

gawbul commented 1 month ago

I've looked into this a bit more and it appears in earlier versions of langchain the text-unicorn model was adhering to the Online prediction requests per base model per minute per region per base_model quota, however, in updated versions of langchain it is now utilising the Generate content requests per minute per project per base model per minute per region per base_model quota instead, which defaults to 0 for everything but the Gemini models, in our account at least. I'm not sure whether it is possible to request a quota limit for that particular one, in the region in question, though I have opened a support request to Google to check.

Is there any change in the code that would have caused this for this particular model? Things seem to work fine for text-bison-32k and other PaLM2 models?

gawbul commented 4 weeks ago

Google got back to me and stated that the PaLM models are no longer supported, so they won't adjust any quota settings for us. It's strange that this works for text-bison-32k for us still, though. I see text-bison is mentioned explicitly in the code and covered as part of the GoogleModelFamily.PALM class. Is this treated differently from the Gemini models? Would there be any future possibility of supporting text-unicorn in the code base as part of that too?

lkuligin commented 4 weeks ago

Could you try it out whether this fix will fix your problem, please?

gawbul commented 3 weeks ago

@lkuligin Thanks for the reply and for making an update to the code. I posted a comment in your PR, as there is a minor typo (should be unicorn).

lkuligin commented 1 week ago

@gawbul is your problem solved?

langchain-ai / langchain-google

Model 'text-unicorn' returns error "ResourceExhausted: 429 Quota exceeded" with langchain-google-vertexai versions >= 1.0.4 #355