livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
537 stars 165 forks source link

Check if model folder exists on startup and request processing #3044

Open eliteprox opened 2 months ago

eliteprox commented 2 months ago

What does this pull request do? Explain your changes. (required)

This PR is dependent on https://github.com/livepeer/ai-worker/pull/79

The change checks if the requested model folder exists when loading during startup (warm only) and gracefully handles the condition of a model folder missing in requests from gateway.

Gateway error log:

I0506 09:29:28.120307 1985227 discovery.go:180] Done fetching orch info numOrch=1 responses=1/1 timedOut=false
I0506 09:29:30.600500 1985227 ai_process.go:344] clientIP=127.0.0.1 request_id=14b57a61 Error submitting request cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1 try=1 orch=https://0.0.0.0:8936 err=Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600545 1985227 handlers.go:1479] clientIP=127.0.0.1 request_id=14b57a61 Error with API code=503 err=no orchestrators available within 2s timeout

AI Core error log on cold model request:

I0506 09:29:28.121922 1984042 ai_http.go:198] manifestID=27_stabilityai/stable-video-diffusion-img2vid-xt-1-1 orchSessionID=8983c425 clientIP=127.0.0.1 Received request id=6156387e cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
2024/05/06 09:29:30 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600020 1984042 handlers.go:1511] HTTP Response Error 503: Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1

AI Core error log on startup:

2024/05/06 10:04:25 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 10:04:25.144208 2005927 starter.go:549] Error AI worker warming text-to-image container: model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist
I0506 10:04:25.144224 2005927 db.go:368] Closing DB

Specific updates (required)

How did you test each of these updates (required)

  1. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to true
  2. Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to false
  3. Sent AI request with gateway to go-livepeer running a cold model name that doesn't exist, received immediate error response from orchestrator of 503.

Does this pull request close any open issues? Addresses LIV-117

Checklist: