The change checks if the requested model folder exists when loading during startup (warm only) and gracefully handles the condition of a model folder missing in requests from gateway.
This improves response times on the network by immediately returning a 503 API error code when the orchestrator is missing the model and is primarily useful for cold models.
This improves orchestrator onboarding by logging the exact path the container is looking for the model in on startup and individual requests when model is not found.
Gateway error log:
I0506 09:29:28.120307 1985227 discovery.go:180] Done fetching orch info numOrch=1 responses=1/1 timedOut=false
I0506 09:29:30.600500 1985227 ai_process.go:344] clientIP=127.0.0.1 request_id=14b57a61 Error submitting request cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1 try=1 orch=https://0.0.0.0:8936 err=Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600545 1985227 handlers.go:1479] clientIP=127.0.0.1 request_id=14b57a61 Error with API code=503 err=no orchestrators available within 2s timeout
AI Core error log on cold model request:
I0506 09:29:28.121922 1984042 ai_http.go:198] manifestID=27_stabilityai/stable-video-diffusion-img2vid-xt-1-1 orchSessionID=8983c425 clientIP=127.0.0.1 Received request id=6156387e cap=27 modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
2024/05/06 09:29:30 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 09:29:30.600020 1984042 handlers.go:1511] HTTP Response Error 503: Insufficient capacity for modelID=stabilityai/stable-video-diffusion-img2vid-xt-1-1
AI Core error log on startup:
2024/05/06 10:04:25 ERROR model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist at /livepeer/ai-core/arbitrum-one-mainnet/models/models--stabilityai--stable-video-diffusion-img2vid-xt-1-1
E0506 10:04:25.144208 2005927 starter.go:549] Error AI worker warming text-to-image container: model stabilityai/stable-video-diffusion-img2vid-xt-1-1 does not exist
I0506 10:04:25.144224 2005927 db.go:368] Closing DB
Specific updates (required)
This code checks if the given model exists on startup and when processing requests.
Uses a new method ModelExists in ai-worker that returns boolean if specific model folder exists
How did you test each of these updates (required)
Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to true
Started go-livepeer with aiModels.json config containing a model that does not exist with warm set to false
Sent AI request with gateway to go-livepeer running a cold model name that doesn't exist, received immediate error response from orchestrator of 503.
Does this pull request close any open issues?
Addresses LIV-117
What does this pull request do? Explain your changes. (required)
This PR is dependent on https://github.com/livepeer/ai-worker/pull/79
The change checks if the requested model folder exists when loading during startup (warm only) and gracefully handles the condition of a model folder missing in requests from gateway.
Gateway error log:
AI Core error log on cold model request:
AI Core error log on startup:
Specific updates (required)
ModelExists
in ai-worker that returns boolean if specific model folder existsHow did you test each of these updates (required)
aiModels.json
config containing a model that does not exist with warm set totrue
aiModels.json
config containing a model that does not exist with warm set tofalse
Does this pull request close any open issues? Addresses LIV-117
Checklist:
make
runs successfully./test.sh
pass