googleapis / python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Apache License 2.0
589 stars 318 forks source link

Vertex AI generative AI not compatible with gevent Monkey #2937

Open mautini opened 8 months ago

mautini commented 8 months ago

I try to use Vertex GenAI in a project that also use gevent. This project import the gevent lib and performed a patch_all at the beginning. After the patch_all call, the vertex ai lib is not able to load the generative model and hang forever. I tried with several combinations of gevent and vertex and none are actually working.

Environment details

Steps to reproduce

  1. Create a project that use gevent Python lib and google-cloud-aiplatform
  2. Apply monkey.patch_all() from gevent
  3. Try to load a vertex AI genAI model

Code example

pip install gevent==23.9.1
pip install google-cloud-aiplatform==1.36.1
from gevent import monkey
monkey.patch_all()

import os
import google.auth
import vertexai
from vertexai.language_models import TextGenerationModel

os.environ['GRPC_VERBOSITY'] = 'info'
os.environ['GRPC_TRACE'] = 'http'

credentials, _ = google.auth.default(quota_project_id='your-project-name')
vertexai.init(project='your-project-name', location="us-central1", credentials=credentials)
text_model = TextGenerationModel.from_pretrained("text-bison")
print(text_model.predict("What's you name?").text)

Stack trace

No Stack trace displayed, it just hang

GRPC trace

When calling TextGenerationModel.from_pretrained("text-bison"):

I1109 14:08:22.419223000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state IDLE -> WRITING [TRANSPORT_FLOW_CONTROL]
I1109 14:08:22.419236000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state WRITING -> WRITING+MORE [INITIAL_WRITE]
I1109 14:08:22.419241000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state WRITING+MORE -> WRITING [begin write in current thread]
I1109 14:08:22.419262000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state WRITING -> IDLE [finish writing]
I1109 14:08:22.427634000 8508514432 parsing.cc:338]                    INCOMING[0x1231b8800]: SETTINGS len:18 id:0x00000000
I1109 14:08:22.427641000 8508514432 frame_settings.cc:233]             CHTTP2:CLI:ipv4:xx.xx.xx.xx:443: got setting MAX_CONCURRENT_STREAMS = 100
I1109 14:08:22.427643000 8508514432 frame_settings.cc:226]             0x1231b8800[cli] adding 983041 for initial_window change
I1109 14:08:22.427645000 8508514432 frame_settings.cc:233]             CHTTP2:CLI:ipv4:xx.xx.xx.xx:443: got setting INITIAL_WINDOW_SIZE = 1048576
I1109 14:08:22.427647000 8508514432 frame_settings.cc:233]             CHTTP2:CLI:ipv4:xx.xx.xx.xx:443: got setting MAX_HEADER_LIST_SIZE = 65536
I1109 14:08:22.427649000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state IDLE -> WRITING [SETTINGS_ACK]
I1109 14:08:22.427651000 8508514432 parsing.cc:338]                    INCOMING[0x1231b8800]: WINDOW_UPDATE len:4 id:0x00000000
I1109 14:08:22.427656000 8508514432 parsing.cc:338]                    INCOMING[0x1231b8800]: SETTINGS:ACK len:0 id:0x00000000
I1109 14:08:22.427677000 8508514432 chttp2_transport.cc:2035]          perform_transport_op[t=0x1231b8800]:  START_CONNECTIVITY_WATCH:watcher=0x6000037812c0:from=READY BIND_POLLSET_SET
I1109 14:08:22.436293000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state WRITING -> WRITING [begin write in current thread]
I1109 14:08:22.436383000 8508514432 chttp2_transport.cc:959]           W:0x1231b8800 CLIENT [ipv4:xx.xx.xx.xx:443] state WRITING -> IDLE [finish writing]
I1109 14:08:38.954401000 8508514432 chttp2_transport.cc:1690]          perform_stream_op[s=0x110927230; op=0x600000584728]:  CANCEL:CANCELLED
I1109 14:08:38.954427000 8508514432 chttp2_transport.cc:1426]          perform_stream_op_locked[s=0x110927230; op=0x600000584728]:  CANCEL:CANCELLED; on_complete = 0x600002c98100
I1109 14:08:38.954452000 8508514432 chttp2_transport.cc:1349]          complete_closure_step: t=0x1231b8800 0x600002c98100 refs=0 flags=0x0000 desc=op->on_complete err=OK write_state=IDLE whence=(null):-1

After some time I got a timeout and GOAWAY command

fran-penedo commented 8 months ago

I found a similar situation when running a CustomContainerTrainingJob in a Compute Engine VM. Interestingly, it only happens when I let it authenticate using the service account associated with the VM. If I set GOOGLE_APPLICATION_CREDENTIALS to a credentials file for the same service account it runs fine.

snayan06 commented 6 months ago

https://github.com/langchain-ai/langchain/issues/15222#issuecomment-1878631787

found similar case when i used this with gemini-pro model and langchain with gevent and flask api was throwing error ,

Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: Exception ignored in: <function _ChannelCallState.__del__ at 0x7eff08bd9160>
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: Traceback (most recent call last):
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "/opt/engati/faq-semantic-searcher/venv/lib/python3.9/site-packages/grpc/_channel.py", line 1247, in __del__
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: self.channel.close(cygrpc.StatusCode.cancelled,
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 515, in grpc._cython.cygrpc.Channel.close
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 399, in grpc._cython.cygrpc._close
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 420, in grpc._cython.cygrpc._close
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "/usr/local/lib/python3.9/threading.py", line 312, in wait
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: waiter.acquire()
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "/opt/engati/faq-semantic-searcher/venv/lib/python3.9/site-packages/gevent/thread.py", line 121, in acquire
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: acquired = BoundedSemaphore.acquire(self, blocking, timeout)
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_semaphore.py", line 259, in gevent._gevent_c_semaphore.Semaphore.acquire
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_semaphore.py", line 249, in gevent._gevent_c_semaphore.Semaphore.acquire
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: gevent.exceptions.LoopExit: This operation would block forever
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: Hub: <Hub '' at 0x7eff17cece80 epoll default pending=0 ref=0 fileno=6 resolver=<gevent.resolver.thread.Resolver at 0x7efe41976ee0 pool=<ThreadPool at 0x7efe41988640 tasks=0 size=4 maxsize=10 hub=<Hub at 0x7eff17cece80 thread_ident=0x7eff2656e740>>> threadpool=<ThreadPool at 0x7efe41988640 tasks=0 size=4 maxsize=10 hub=<Hub at 0x7eff17cece80 thread_ident=0x7eff2656e740>> thread_ident=0x7eff2656e740>
Jan 05 11:23:47 di-appserver11.dev.engati.local bash[17300]: Handles:
snayan06 commented 6 months ago

i was able to resolve this with using the monkey patching like this mentioned in the pr : https://github.com/grpc/grpc/pull/14561#issue-301487490 , thanks