Closed Nintorac closed 1 year ago
I suspect number of threads isn’t giving you any speed ups as it works well with IO-bound tasks, while here you’re probably more CPU-bound (or GPU-bound).
How about using something like Kubernetes deployments and increasing the number of replicas (e.g with kubectl scale
)? It’s a fairly more involved process than just increasing the number of threads but might be also more flexible.
na, this is calling out to OpenAI (in Azure) to do the inference
I'm slightly confused: are you forwarding requests to OpenAI through a SimpleAI instance?
Yea, it was easier than rewriting my consumer code to accommodate the config differences to Azure
I was thinking about adding some proxy
type of backend which just pass the query to another url, exactly for that kind of use cases. That should be quick to implement and would get rid of the gRPC dependency / bottleneck for this use case. Happy to start working on this if you think it's worth it.
ooh yeh, that would be cool!
Oh, silly me, just needed to scale the number of FastAPI workers using Gunicorn! This also seems to work using actual models, didn't realise it would be that easy to share the weights between threads.
Hey,
Do you know how to set parallelism? I have wrapped a few API's. eg Azure OpenAI endpoint and I can't seem to get it to serve in parallel?
I have tried modifying the number of threads assigned to the server but dont get any speedups i.e like here. any ideas what I'm missing?