lhenault / simpleAI

An easy way to host your own AI API and expose alternative models, while being compatible with "open" AI clients.
https://pypi.org/project/simple-ai-server/
MIT License
327 stars 36 forks source link

Parallelism #33

Closed Nintorac closed 1 year ago

Nintorac commented 1 year ago

Hey,

Do you know how to set parallelism? I have wrapped a few API's. eg Azure OpenAI endpoint and I can't seem to get it to serve in parallel?

I have tried modifying the number of threads assigned to the server but dont get any speedups i.e like here. any ideas what I'm missing?

lhenault commented 1 year ago

I suspect number of threads isn’t giving you any speed ups as it works well with IO-bound tasks, while here you’re probably more CPU-bound (or GPU-bound).

How about using something like Kubernetes deployments and increasing the number of replicas (e.g with kubectl scale)? It’s a fairly more involved process than just increasing the number of threads but might be also more flexible.

Nintorac commented 1 year ago

na, this is calling out to OpenAI (in Azure) to do the inference

lhenault commented 1 year ago

I'm slightly confused: are you forwarding requests to OpenAI through a SimpleAI instance?

Nintorac commented 1 year ago

Yea, it was easier than rewriting my consumer code to accommodate the config differences to Azure

lhenault commented 1 year ago

I was thinking about adding some proxy type of backend which just pass the query to another url, exactly for that kind of use cases. That should be quick to implement and would get rid of the gRPC dependency / bottleneck for this use case. Happy to start working on this if you think it's worth it.

Nintorac commented 1 year ago

ooh yeh, that would be cool!

Nintorac commented 1 year ago

Oh, silly me, just needed to scale the number of FastAPI workers using Gunicorn! This also seems to work using actual models, didn't realise it would be that easy to share the weights between threads.