googleapis / python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Apache License 2.0
638 stars 346 forks source link

Running `.from_pretrained` in ThreadPoolExecutor causes deadlock #4342

Open roman-romanov-o opened 2 months ago

roman-romanov-o commented 2 months ago

Running multiple .from_pretrained in ThreadPool executor cases deadlock during grpc call

Environment details

Steps to reproduce

  1. Install same version of library
  2. Run script from code sample

Code example

Script, that reproduces deadlock:

import asyncio
import os
from concurrent.futures import ThreadPoolExecutor

import vertexai
from vertexai.preview.language_models import ChatModel

async def main():
    DEFAULT_REGION = os.environ["DEFAULT_REGION"]
    GCP_PROJECT_ID = os.environ["GCP_PROJECT_ID"]
    assert DEFAULT_REGION
    assert GCP_PROJECT_ID
    vertexai.init(project=GCP_PROJECT_ID, location=DEFAULT_REGION)
    print("Vertex client initialized")

    loop = asyncio.get_event_loop()
    sync_tasks = [
        lambda: ChatModel.from_pretrained("chat-bison@001"),
        lambda: ChatModel.from_pretrained("chat-bison-32k@002"),
    ]

    with ThreadPoolExecutor() as executor:
        tasks = [loop.run_in_executor(executor, task) for task in sync_tasks]
        await asyncio.gather(*tasks)
    print("Models are loaded")

if __name__ == "__main__":
    asyncio.run(main())

Stack trace

gdb stack trace:

gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x7989e4665b80 (LWP 2493484) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7ffca9b90980, op=137, 
    expected=0, futex_word=0x7989e4563eb0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57
  2    Thread 0x7989d99ff640 (LWP 2493495) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eade0 <thread_status+96>) at ./nptl/futex-internal.c:57
  3    Thread 0x7989d91fe640 (LWP 2493496) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eae60 <thread_status+224>) at ./nptl/futex-internal.c:57
  4    Thread 0x7989d89fd640 (LWP 2493497) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eaee0 <thread_status+352>) at ./nptl/futex-internal.c:57
  5    Thread 0x7989d16ff640 (LWP 2493498) "python"          __futex_abstimed_wait_common64 (private=-781204736, cancel=true, abstime=0x7989d16fc480, op=137, 
    expected=0, futex_word=0x7989e4563eb0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57
  6    Thread 0x7989d0efe640 (LWP 2493499) "python"          syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7989cbfff640 (LWP 2493500) "default-executo" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8    Thread 0x7989cb7fe640 (LWP 2493501) "resolver-execut" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7989caffd640 (LWP 2493502) "grpc_global_tim" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
(gdb) 
  Id   Target Id                                             Frame 
* 1    Thread 0x7989e4665b80 (LWP 2493484) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7ffca9b90980, op=137, 
    expected=0, futex_word=0x7989e4563eb0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57
  2    Thread 0x7989d99ff640 (LWP 2493495) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eade0 <thread_status+96>) at ./nptl/futex-internal.c:57
  3    Thread 0x7989d91fe640 (LWP 2493496) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eae60 <thread_status+224>) at ./nptl/futex-internal.c:57
  4    Thread 0x7989d89fd640 (LWP 2493497) "python"          __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x7989dc3eaee0 <thread_status+352>) at ./nptl/futex-internal.c:57
  5    Thread 0x7989d16ff640 (LWP 2493498) "python"          __futex_abstimed_wait_common64 (private=-781204736, cancel=true, abstime=0x7989d16fc480, op=137, 
    expected=0, futex_word=0x7989e4563eb0 <_PyRuntime+432>) at ./nptl/futex-internal.c:57
  6    Thread 0x7989d0efe640 (LWP 2493499) "python"          syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7989cbfff640 (LWP 2493500) "default-executo" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8    Thread 0x7989cb7fe640 (LWP 2493501) "resolver-execut" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7989caffd640 (LWP 2493502) "grpc_global_tim" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

py-bt stack

(gdb) py-bt
Traceback (most recent call first):
  Waiting for the GIL
  (unable to read python frame information)
roman-romanov-o commented 2 months ago

In the script above, if we leave only one task

    sync_tasks = [
        lambda: ChatModel.from_pretrained("chat-bison@001"),
        #lambda: ChatModel.from_pretrained("chat-bison-32k@002"),
    ]

everything will proceed normally Deadlock is caused, when there are multiple sync_tasks in thread pool