Initializing cluster in FastAPI results in boot hang

fryz commented 4 months ago

Describe the bug

Example code: https://github.com/fryz/funhouse/tree/zf/fastapi/fastapi

When utilizing FastAPIs Lifespan Events (asynccontextmanager) to bring a cluster up, the cluster comes up but then hangs without returning to back to the server initialization logic.

Terminating the FastAPI process and bringing it back online recognizes the cluster and works.

Versions

(.venv) zach@Zachs-MacBook-Pro (/Users/zach/git/fun/funhouse/fastapi) (zf/fastapi)
[09:18:06]$ python collect_env.py
Python Platform: macOS-12.4-arm64-arm-64bit
Python Version: 3.11.7 (main, Jan 16 2024, 14:42:22) [Clang 14.0.0 (clang-1400.0.29.202)]

Relevant packages:
boto3==1.34.124
fastapi==0.111.0
fastapi-cli==0.0.4
fsspec==2023.5.0
opentelemetry-instrumentation-fastapi==0.46b0
pycryptodome==3.12.0
rich==13.7.1
runhouse==0.0.28
skypilot==0.5.0
sshfs==2023.4.1
sshtunnel==0.4.0
typer==0.12.3
uvicorn==0.30.1
wheel==0.43.0

Checking credentials to enable clouds for SkyPilot.
  AWS: enabled
    Hint: AWS SSO is set. To ensure multiple clouds work correctly, please use SkyPilot with static credentials (e.g., ~/.aws/credentials) by unsetting the AWS_PROFILE environment variable.
  Azure: disabled
    Reason: Getting user's Azure identity failed. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
    Details: sky.exceptions.CloudUserIdentityError: Failed to import 'knack'. To install the dependencies for Azure, Please install SkyPilot with: pip install skypilot[azure]
  Cloudflare, for R2 object store: disabled
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
  Cudo: disabled
    Reason: Cudo tools are not installed. Run the following commands:
      $ pip install cudo-compute
    [ModuleNotFoundError] No module named 'cudo_compute'
  Fluidstack: disabled
    Reason: Failed to access FluidStack Cloud with credentials. To configure credentials, go to:
      https://console.fluidstack.io
    to obtain an API key and API Token, then add save the contents to ~/.fluidstack/api_key and ~/.fluidstack/api_token

  GCP: disabled
    Reason: GCP tools are not installed. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
    Credentials may also need to be set. Run the following commands:
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
    Details: [ModuleNotFoundError] No module named 'googleapiclient'
  IBM: disabled
    Reason: Missing credential file at /Users/zach/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  Kubernetes: disabled
    Reason: `kubernetes` package is not installed. Install it with: pip install kubernetes
  Lambda: disabled
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  OCI: disabled
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  RunPod: disabled
    Reason: Failed to import runpod. To install, run: pip install skypilot[runpod]
  SCP: disabled
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  vSphere: disabled
    Reason: vSphere dependencies are not installed. Run the following commands:
      $ pip install skypilot[vSphere]
    Credentials may also need to be set. For more details. See https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#vmware-vsphere[ModuleNotFoundError] No module named 'pyVmomi'

To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html

🎉 Enabled clouds 🎉
  ✔ AWS
Clusters
I 06-12 09:18:16 backend_utils.py:2405] Autodowned clusters: fastapi-runhouse-example, arthur-shield-gpu-cluster
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Services
No existing services. (See: sky serve -h)

Additional context

Logs from the startup:

(.venv) zach@Zachs-MacBook-Pro (/Users/zach/git/fun/funhouse/fastapi) (zf/fastapi)
[08:59:13]$ fastapi dev app.py
INFO     Using path app.py
INFO     Resolved absolute path /Users/zach/git/fun/funhouse/fastapi/app.py
INFO     Searching for package file structure from directories with __init__.py files
INFO     Importing from /Users/zach/git/fun/funhouse/fastapi

 ╭─ Python module file ─╮
 │                      │
 │  🐍 app.py           │
 │                      │
 ╰──────────────────────╯

INFO     Importing module app

 ╭─ Importable FastAPI app ─╮
 │                          │
 │  from app import app     │
 │                          │
 ╰──────────────────────────╯

 ╭────────── FastAPI CLI - Development mode ───────────╮
 │                                                     │
 │  Serving at: http://127.0.0.1:8000                  │
 │                                                     │
 │  API docs: http://127.0.0.1:8000/docs               │
 │                                                     │
 │  Running in development mode, for production use:   │
 │                                                     │
 │  fastapi run                                        │
 │                                                     │
 ╰─────────────────────────────────────────────────────╯

INFO:     Will watch for changes in these directories: ['/Users/zach/git/fun/funhouse/fastapi']
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [35530] using WatchFiles
I 06-12 08:59:25 optimizer.py:691] == Optimizer ==
I 06-12 08:59:25 optimizer.py:714] Estimated cost: $0.1 / hour
I 06-12 08:59:25 optimizer.py:714]
I 06-12 08:59:25 optimizer.py:837] Considered resources (1 node):
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907]  CLOUD   INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907]  AWS     m6i.large   2       8         -              us-east-2     0.10          ✔
I 06-12 08:59:25 optimizer.py:907] ----------------------------------------------------------------------------------------
I 06-12 08:59:25 optimizer.py:907]
I 06-12 08:59:25 cloud_vm_ray_backend.py:4246] Creating a new cluster: 'fastapi-runhouse-example' [1x AWS(m6i.large)].
I 06-12 08:59:25 cloud_vm_ray_backend.py:4246] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
INFO | 2024-06-12 12:59:26.165931 | 3 changes detected
INFO | 2024-06-12 12:59:26.532277 | 280 changes detected
INFO | 2024-06-12 12:59:26.899229 | 122 changes detected
I 06-12 08:59:27 cloud_vm_ray_backend.py:1373] To view detailed progress: tail -n100 -f /Users/zach/sky_logs/sky-2024-06-12-08-59-25-800758/provision.log
I 06-12 08:59:28 provisioner.py:76] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
I 06-12 09:00:35 provisioner.py:451] Successfully provisioned or found existing instance.
I 06-12 09:02:03 provisioner.py:553] Successfully provisioned cluster: fastapi-runhouse-example
I 06-12 09:02:05 cloud_vm_ray_backend.py:3266] Run commands not specified or empty.
Clusters
NAME                       LAUNCHED        RESOURCES                       STATUS  AUTOSTOP  COMMAND
fastapi-runhouse-example   a few secs ago  1x AWS(m6i.large)               UP      (down)    /Users/zach/git/fun/funho...
arthur-shield-gpu-cluster  1 week ago      1x AWS(g3s.xlarge, {'M60': 1})  UP      (down)    /Users/zach/git/arthur-sh...

INFO | 2024-06-12 13:02:12.533477 | Restarting Runhouse API server on fastapi-runhouse-example.
INFO | 2024-06-12 13:02:12.540893 | Running command on fastapi-runhouse-example: python3 -m pip install runhouse==0.0.28
Collecting runhouse==0.0.28
  Downloading runhouse-0.0.28-py3-none-any.whl (366 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 366.6/366.6 kB 2.8 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-requests
  Downloading opentelemetry_instrumentation_requests-0.46b0-py3-none-any.whl (12 kB)
Collecting pyOpenSSL>=23.3.0
  Downloading pyOpenSSL-24.1.0-py3-none-any.whl (56 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 kB 12.3 MB/s eta 0:00:00
Requirement already satisfied: ray[default]!=2.6.0,<=2.6.3,>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (2.4.0)
Collecting pexpect
  Downloading pexpect-4.9.0-py2.py3-none-any.whl (63 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.8/63.8 kB 13.8 MB/s eta 0:00:00
Collecting fastapi
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.0/92.0 kB 19.3 MB/s eta 0:00:00
Collecting apispec
  Downloading apispec-6.6.1-py3-none-any.whl (30 kB)
Requirement already satisfied: rich in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (13.7.1)
Collecting opentelemetry-sdk
  Downloading opentelemetry_sdk-1.25.0-py3-none-any.whl (107 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.0/107.0 kB 1.3 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-fastapi
  Downloading opentelemetry_instrumentation_fastapi-0.46b0-py3-none-any.whl (11 kB)
Collecting typer
  Downloading typer-0.12.3-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.2/47.2 kB 9.5 MB/s eta 0:00:00
Collecting sshfs<=2023.4.1,>=2023.1.0
  Downloading sshfs-2023.4.1-py3-none-any.whl (15 kB)
Collecting uvicorn
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.4/62.4 kB 13.7 MB/s eta 0:00:00
Collecting opentelemetry-exporter-otlp-proto-http
  Downloading opentelemetry_exporter_otlp_proto_http-1.25.0-py3-none-any.whl (16 kB)
Collecting fsspec<=2023.5.0
  Downloading fsspec-2023.5.0-py3-none-any.whl (160 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.1/160.1 kB 29.6 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation
  Downloading opentelemetry_instrumentation-0.46b0-py3-none-any.whl (29 kB)
Collecting sentry-sdk
  Downloading sentry_sdk-2.5.1-py2.py3-none-any.whl (289 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289.6/289.6 kB 3.7 MB/s eta 0:00:00
Requirement already satisfied: python-dotenv in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (1.0.1)
Collecting httpx
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.6/75.6 kB 1.2 MB/s eta 0:00:00
Collecting opentelemetry-api
  Downloading opentelemetry_api-1.25.0-py3-none-any.whl (59 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.9/59.9 kB 10.2 MB/s eta 0:00:00
Requirement already satisfied: wheel in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (0.38.4)
Requirement already satisfied: setuptools<70.0.0 in /opt/conda/lib/python3.10/site-packages (from runhouse==0.0.28) (65.6.3)
Collecting cryptography<43,>=41.0.5
  Downloading cryptography-42.0.8-cp39-abi3-manylinux_2_28_x86_64.whl (3.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 28.9 MB/s eta 0:00:00
Requirement already satisfied: virtualenv<20.21.1,>=20.0.24 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (20.21.0)
Requirement already satisfied: click>=7.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (8.1.7)
Requirement already satisfied: jsonschema in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.22.0)
Requirement already satisfied: numpy>=1.19.3 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.26.4)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.25.3)
Requirement already satisfied: grpcio<=1.51.3,>=1.42.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.51.3)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (6.0.1)
Requirement already satisfied: frozenlist in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.4.1)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.14.0)
Requirement already satisfied: attrs in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (23.2.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (24.1)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.28.2)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.0.8)
Requirement already satisfied: aiosignal in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.3.1)
Requirement already satisfied: gpustat>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.1.1)
Requirement already satisfied: colorful in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.5.6)
Requirement already satisfied: aiohttp-cors in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.7.0)
Requirement already satisfied: opencensus in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.11.4)
Requirement already satisfied: py-spy>=0.2.0 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.3.14)
Requirement already satisfied: prometheus-client>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.20.0)
Requirement already satisfied: pydantic in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.10.16)
Requirement already satisfied: aiohttp>=3.7 in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.9.5)
Requirement already satisfied: smart-open in /opt/conda/lib/python3.10/site-packages (from ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (7.0.4)
Collecting asyncssh<3,>=2.11.0
  Downloading asyncssh-2.14.2-py3-none-any.whl (352 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 352.5/352.5 kB 4.4 MB/s eta 0:00:00
Collecting ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1
  Downloading ujson-5.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.6/53.6 kB 10.3 MB/s eta 0:00:00
Requirement already satisfied: jinja2>=2.11.2 in /opt/conda/lib/python3.10/site-packages (from fastapi->runhouse==0.0.28) (3.1.4)
Collecting starlette<0.38.0,>=0.37.2
  Downloading starlette-0.37.2-py3-none-any.whl (71 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.9/71.9 kB 16.3 MB/s eta 0:00:00
Collecting python-multipart>=0.0.7
  Downloading python_multipart-0.0.9-py3-none-any.whl (22 kB)
Collecting orjson>=3.2.1
  Downloading orjson-3.10.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.7/142.7 kB 1.9 MB/s eta 0:00:00
Collecting fastapi-cli>=0.0.2
  Downloading fastapi_cli-0.0.4-py3-none-any.whl (9.5 kB)
Collecting email_validator>=2.0.0
  Downloading email_validator-2.1.1-py3-none-any.whl (30 kB)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from fastapi->runhouse==0.0.28) (4.12.2)
Collecting anyio
  Downloading anyio-4.4.0-py3-none-any.whl (86 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 18.8 MB/s eta 0:00:00
Collecting sniffio
  Downloading sniffio-1.3.1-py3-none-any.whl (10 kB)
Requirement already satisfied: certifi in /opt/conda/lib/python3.10/site-packages (from httpx->runhouse==0.0.28) (2022.12.7)
Requirement already satisfied: idna in /opt/conda/lib/python3.10/site-packages (from httpx->runhouse==0.0.28) (3.4)
Collecting httpcore==1.*
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.9/77.9 kB 1.0 MB/s eta 0:00:00
Collecting h11<0.15,>=0.13
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 14.1 MB/s eta 0:00:00
Collecting importlib-metadata<=7.1,>=6.0
  Downloading importlib_metadata-7.1.0-py3-none-any.whl (24 kB)
Collecting deprecated>=1.2.6
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: googleapis-common-protos~=1.52 in /opt/conda/lib/python3.10/site-packages (from opentelemetry-exporter-otlp-proto-http->runhouse==0.0.28) (1.63.1)
Collecting opentelemetry-exporter-otlp-proto-common==1.25.0
  Downloading opentelemetry_exporter_otlp_proto_common-1.25.0-py3-none-any.whl (17 kB)
Collecting opentelemetry-proto==1.25.0
  Downloading opentelemetry_proto-1.25.0-py3-none-any.whl (52 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.5/52.5 kB 532.4 kB/s eta 0:00:00
Collecting opentelemetry-semantic-conventions==0.46b0
  Downloading opentelemetry_semantic_conventions-0.46b0-py3-none-any.whl (130 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.5/130.5 kB 28.8 MB/s eta 0:00:00
Requirement already satisfied: wrapt<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from opentelemetry-instrumentation->runhouse==0.0.28) (1.16.0)
Collecting opentelemetry-instrumentation-asgi==0.46b0
  Downloading opentelemetry_instrumentation_asgi-0.46b0-py3-none-any.whl (14 kB)
Collecting opentelemetry-util-http==0.46b0
  Downloading opentelemetry_util_http-0.46b0-py3-none-any.whl (6.9 kB)
Collecting asgiref~=3.0
  Downloading asgiref-3.8.1-py3-none-any.whl (23 kB)
Collecting ptyprocess>=0.5
  Downloading ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.10/site-packages (from rich->runhouse==0.0.28) (2.18.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.10/site-packages (from rich->runhouse==0.0.28) (3.0.0)
Requirement already satisfied: urllib3>=1.26.11 in /opt/conda/lib/python3.10/site-packages (from sentry-sdk->runhouse==0.0.28) (1.26.14)
Collecting shellingham>=1.3.0
  Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.0.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp>=3.7->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (6.0.5)
Requirement already satisfied: cffi>=1.12 in /opt/conda/lib/python3.10/site-packages (from cryptography<43,>=41.0.5->pyOpenSSL>=23.3.0->runhouse==0.0.28) (1.15.1)
Collecting dnspython>=2.0.0
  Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.7/307.7 kB 5.3 MB/s eta 0:00:00
Requirement already satisfied: nvidia-ml-py>=11.450.129 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (12.555.43)
Requirement already satisfied: psutil>=5.6.0 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (5.9.8)
Requirement already satisfied: blessed>=1.17.1 in /opt/conda/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.20.0)
Collecting zipp>=0.5
  Downloading zipp-3.19.2-py3-none-any.whl (9.0 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2>=2.11.2->fastapi->runhouse==0.0.28) (2.1.5)
Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->runhouse==0.0.28) (0.1.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.1.1)
Collecting exceptiongroup>=1.0.2
  Downloading exceptiongroup-1.2.1-py3-none-any.whl (16 kB)
Collecting websockets>=10.4
  Downloading websockets-12.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (130 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 130.2/130.2 kB 2.3 MB/s eta 0:00:00
Collecting uvloop!=0.15.0,!=0.15.1,>=0.14.0
  Downloading uvloop-0.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 29.5 MB/s eta 0:00:00
Collecting watchfiles>=0.13
  Downloading watchfiles-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 12.5 MB/s eta 0:00:00
Collecting httptools>=0.5.0
  Downloading httptools-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (341 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.4/341.4 kB 5.4 MB/s eta 0:00:00
Requirement already satisfied: distlib<1,>=0.3.6 in /opt/conda/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.3.8)
Requirement already satisfied: platformdirs<4,>=2.4 in /opt/conda/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (3.11.0)
Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.35.1)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2023.12.1)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from jsonschema->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.18.1)
Requirement already satisfied: opencensus-context>=0.1.3 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.1.3)
Requirement already satisfied: six~=1.16 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.16.0)
Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.19.0)
Requirement already satisfied: wcwidth>=0.1.4 in /opt/conda/lib/python3.10/site-packages (from blessed>=1.17.1->gpustat>=1.0.0->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.2.13)
Requirement already satisfied: pycparser in /opt/conda/lib/python3.10/site-packages (from cffi>=1.12->cryptography<43,>=41.0.5->pyOpenSSL>=23.3.0->runhouse==0.0.28) (2.21)
Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /opt/conda/lib/python3.10/site-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (1.23.0)
Requirement already satisfied: google-auth<3.0.dev0,>=2.14.1 in /opt/conda/lib/python3.10/site-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (2.30.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (5.3.3)
Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (4.7.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.10/site-packages (from google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.4.0)
Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /opt/conda/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.dev0,>=2.14.1->google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]!=2.6.0,<=2.6.3,>=2.2.0->runhouse==0.0.28) (0.6.0)
Installing collected packages: ptyprocess, zipp, websockets, uvloop, ujson, sniffio, shellingham, sentry-sdk, python-multipart, pexpect, orjson, opentelemetry-util-http, opentelemetry-proto, httptools, h11, fsspec, exceptiongroup, dnspython, deprecated, asgiref, apispec, uvicorn, opentelemetry-exporter-otlp-proto-common, importlib-metadata, httpcore, email_validator, cryptography, anyio, watchfiles, typer, starlette, pyOpenSSL, opentelemetry-api, httpx, asyncssh, sshfs, opentelemetry-semantic-conventions, opentelemetry-instrumentation, fastapi-cli, opentelemetry-sdk, opentelemetry-instrumentation-requests, opentelemetry-instrumentation-asgi, fastapi, opentelemetry-instrumentation-fastapi, opentelemetry-exporter-otlp-proto-http, runhouse
  Attempting uninstall: cryptography
    Found existing installation: cryptography 39.0.1
    Uninstalling cryptography-39.0.1:
      Successfully uninstalled cryptography-39.0.1
  Attempting uninstall: pyOpenSSL
    Found existing installation: pyOpenSSL 23.0.0
    Uninstalling pyOpenSSL-23.0.0:
      Successfully uninstalled pyOpenSSL-23.0.0
Successfully installed anyio-4.4.0 apispec-6.6.1 asgiref-3.8.1 asyncssh-2.14.2 cryptography-42.0.8 deprecated-1.2.14 dnspython-2.6.1 email_validator-2.1.1 exceptiongroup-1.2.1 fastapi-0.111.0 fastapi-cli-0.0.4 fsspec-2023.5.0 h11-0.14.0 httpcore-1.0.5 httptools-0.6.1 httpx-0.27.0 importlib-metadata-7.1.0 opentelemetry-api-1.25.0 opentelemetry-exporter-otlp-proto-common-1.25.0 opentelemetry-exporter-otlp-proto-http-1.25.0 opentelemetry-instrumentation-0.46b0 opentelemetry-instrumentation-asgi-0.46b0 opentelemetry-instrumentation-fastapi-0.46b0 opentelemetry-instrumentation-requests-0.46b0 opentelemetry-proto-1.25.0 opentelemetry-sdk-1.25.0 opentelemetry-semantic-conventions-0.46b0 opentelemetry-util-http-0.46b0 orjson-3.10.4 pexpect-4.9.0 ptyprocess-0.7.0 pyOpenSSL-24.1.0 python-multipart-0.0.9 runhouse-0.0.28 sentry-sdk-2.5.1 shellingham-1.5.4 sniffio-1.3.1 sshfs-2023.4.1 starlette-0.37.2 typer-0.12.3 ujson-5.10.0 uvicorn-0.30.1 uvloop-0.19.0 watchfiles-0.22.0 websockets-12.0 zipp-3.19.2
Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:21.961759 | Running command on fastapi-runhouse-example: mkdir -p ~/.rh; touch ~/.rh/cluster_config.json; echo '{"name": "fastapi-runhouse-example", "resource_type": "cluster", "resource_subtype": "OnDemandCluster", "provenance": null, "visibility": "private", "ips": ["18.119.117.17"], "server_port": 32300, "server_connection_type": "ssh", "den_auth": false, "use_local_telemetry": false, "ssh_port": 22, "api_server_url": "https://api.run.house", "instance_type": "CPU:2+", "provider": "aws", "open_ports": [], "use_spot": false, "region": "us-east-2", "stable_internal_external_ips": [["10.16.96.251", "18.119.117.17"]], "autostop_mins": -1}' > ~/.rh/cluster_config.json
Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:23.046387 | Running command on fastapi-runhouse-example: runhouse restart --restart-ray --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python
INFO | 2024-06-12 13:02:27.383278 | Using port: 32300.
INFO | 2024-06-12 13:02:27.383828 | Setting api_server url to https://api.run.house
INFO | 2024-06-12 13:02:27.383922 | Starting server in default env named: _cluster_default_env
INFO | 2024-06-12 13:02:27.383989 | Creating runtime env for conda env: None
INFO | 2024-06-12 13:02:27.385414 | Starting API server using the following command: screen -dm bash -c "/opt/conda/bin/python3 -m runhouse.servers.http.http_server --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python 2>&1 | tee -a '/home/ubuntu/.rh/server.log' 2>&1".
Executing `pkill -f "/opt/conda/bin/python3 -m runhouse.servers.http.http_server"`
Executing `pkill -f ".*ray.*6379.*"`
Executing `ray start --head --port 6379 --disable-usage-stats`
Usage stats collection is disabled.

Local node IP: 10.16.96.251

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.16.96.251:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
Executing `screen -dm bash -c "/opt/conda/bin/python3 -m runhouse.servers.http.http_server --port 32300 --api-server-url https://api.run.house --default-env-name _cluster_default_env --from-python 2>&1 | tee -a '/home/ubuntu/.rh/server.log' 2>&1"`
INFO | 2024-06-12 13:02:34.807124 | Loaded cluster config from Ray.
INFO | 2024-06-12 13:02:34.809339 | Updated cluster config with parsed argument values.
INFO | 2024-06-12 13:02:34.838024 | Preparing to send telemetry to https://api.run.house:14318
INFO | 2024-06-12 13:02:34.845426 | Successfully added telemetry exporter https://api.run.house:14318
WARNING | 2024-06-12 13:02:34.845612 | Attempting to instrument FastAPI app while already instrumented
WARNING | 2024-06-12 13:02:34.845691 | Attempting to instrument while already instrumented
INFO | 2024-06-12 13:02:36.204163 | Launching Runhouse API server with den_auth=False and use_local_telemetry=False on host=0.0.0.0 and use_https=False and port_arg=32300
INFO:     Started server process [29226]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:32300 (Press CTRL+C to quit)

Shared connection to 18.119.117.17 closed.
INFO | 2024-06-12 13:02:37.401994 | Forwarding port 32300 to port 32300 on localhost.
INFO | 2024-06-12 13:02:38.740679 | Server fastapi-runhouse-example is up.
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
INFO | 2024-06-12 13:02:39.209612 | Forwarding port 32300 to port 32300 on localhost.

jlewitt1 commented 4 months ago

Thanks for reaching out!

The FastAPI app you have set up here is probably not needed for this particular use case of deploying and calling a module (but do let us know otherwise if there's something specific you had in mind). Behind the scenes Runhouse spins up a FastAPI server on the cluster which will allow you to directly call the module on the cluster via an HTTP call. What you are seeing are the logs of the Runhouse server stored on the cluster in path: ~/.rh/server.log.

The below snippet should likely be all you need to deploy and call this module:

import runhouse as rh

import numpy as np
from scipy.special import softmax
from transformers import AutoModelForSequenceClassification, AutoConfig
from transformers import AutoTokenizer

class SentimentAnalysis:

    def __init__(self, model_name="cardiffnlp/twitter-roberta-base-sentiment-latest"):
        self.model_name = model_name
        self.model = None
        self.config = None
        self.tokenizer = None

    @staticmethod
    def preprocess(text):
        """
        Preprocess text (username and link placeholders)
        """
        new_text = []
        for t in text.split(" "):
            t = '@user' if t.startswith('@') and len(t) > 1 else t
            t = 'http' if t.startswith('http') else t
            new_text.append(t)
        return " ".join(new_text)

    def predict(self, text):
        if self.model is None:
            self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name)

        if self.config is None:
            self.config = AutoConfig.from_pretrained(self.model)

        if self.tokenizer is None:
            self.tokenizer = AutoTokenizer.from_pretrained()

        text = SentimentAnalysis.preprocess(text)
        encoded_input = self.tokenizer(text, return_tensors='pt')
        output = self.model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)

        ranking = np.argsort(scores)
        ranking = ranking[::-1]
        l_scores = {}
        for i in range(scores.shape[0]):
            l = self.config.id2label[ranking[i]]
            s = scores[ranking[i]]
            l_scores[l] = np.round(float(s), 4)

        return l_scores

cluster = rh.ondemand_cluster(
    name="fastapi-runhouse-example",
    instance_type="CPU:2+",
    provider="aws",
    region="us-east-1"
).up_if_not()

# Set up an env using the requirements specified in the working dir
my_env = rh.env(name="scorer_env", working_dir="./")

# Send the module and its associated env to the cluster (or reload if it already exists on the cluster)
RemoteScorer = rh.module(SentimentAnalysis, name="remote-scorer").to(cluster, env=my_env)

# Generate a URL that can be used to call the module's "predict" method from anywhere
base_url = f"{RemoteScorer.endpoint()}/predict"
print(base_url)

For example, to then call the predict method via CURL: curl http://<CLUSTER-IP>:32300/remote-scorer/predict?text="Some text"

A couple other things to note based on the above snippet:

We should generally specify an env for this module in order to ensure that the necessary requirements get installed on the cluster.
Naming the module will make it more convenient to call later on - this can be specified in the module factory
You can use get_or_to instead of to when sending the module to the cluster. This allows you to load the existing module by its name if it was already previously saved on the cluster
You can ssh onto the cluster using: ssh fastapi-runhouse-example and restart the runhouse server anytime by running runhouse restart on the cluster
By setting the model, tokenizer, and config to None in the constructor of the SentimentAnalysis module we can prevent unnecessary reloading when hitting the “predict” endpoint

fryz commented 4 months ago

Thanks for following up @jlewitt1

I probably should have filled in some additional context - I caught up with Donny in our office earlier this week to discuss what we were planning on doing, but didn't fill in details in this issue.

High level, what I'm looking to accomplish is to manage the GPU cluster within business logic of an application that we are developing. Specifically, the goal is to be able to inference a handful of ML Models on GPUs while running the service on CPUs. Our service is implemented using FastAPI, and clients of our service interact through the services API. I want to ensure that the GPU interaction is completely opaque to the end-user as it's an implementation detail - they shouldn't know where the inference is running.

One thing I like about runhouse is that it looks like I can tie the lifecycle of the cluster to the lifecycle of the application - eg: start the cluster up when the app starts up an d terminate the cluster when the app spins down. I also talked with Donny on how to support autoscaling and service discovery mechanics as well. It seems like I can manage the infrastructure through our application rather than having to build and support these services out through our deployment artifacts.

The intent in opening this issue was to highlight a bug in the Cluster.up_if_not() method - it seems like the process/thread that initializes the cluster doesn't ever return to the main control thread. So when I boot up my service and runhouse brings up the GPU cluster for the first time, the service hangs and requires a restart.

Does this make sense?

You can see what I mean if you run the app in my example code. The first time you run it, it will bring up the cluster but the process will hang and FastAPI won't serve requests (eg: the docs page at localhost:8000/docs or the API at localhost:8000/ping). But if you terminate the process and re-launch, it will detect that the cluster is already up and then FastAPI will serve it's API.

dongreenberg commented 4 months ago

Hi Zach! Thanks for raising this and the detailed repro. I've been offline for a couple days for Jewish holidays and didn't get a chance to share the context with Josh, so thanks for the detail here. I've reproduced your error and figured out that it's a minor bug which we fixed on main but haven't released yet, and for some reason wasn't being surfaced through fastapi's lifespan feature (it also wasn't being surfaced when I ran with uvicorn app:app). I've confirmed that the launch works properly in your script on runhouse@main, and we're planning to release within the next couple days (note that if you try upgrading to main, be sure to upgrade SkyPilot too, because we also bumped the SkyPilot version to 0.6.0 in the latest release. You may want to take down any up clusters before doing that). I'll update you here when we release the fix.

Aside, I also noticed that the code doesn't complete because the working_dir isn't being recognized by that requirements.txt, the .git root one directory above is taking precedent. I think we want to change that behavior soon (and some other working_dir things as well), but in the meantime you can explicitly set the working_dir in an rh.env, or simply move it one directory higher to be recognized and installed on the cluster (I also confired that if you do this, your repro will run through in full, see below)

curl -X POST "http://127.0.0.1:8000/score?text=Good"
{"positive":0.6844,"neutral":0.2628,"negative":0.0527}

curl -X POST "http://127.0.0.1:8000/score?text=This\restaurant\is\bad"
{"negative":0.951,"neutral":0.0434,"positive":0.0056}

fryz commented 4 months ago

Rad - thanks for the update. I'll watch this issue and let you know if it works after your next release.

dongreenberg commented 4 months ago

Hey Zach - we released yesterday and I've confirmed this is fixed (though still moving the requirements.txt to the git root directory). Let me know if you still face any breakage with it.

run-house / runhouse

Initializing cluster in FastAPI results in boot hang #888