modular-ml / wrapyfi-examples_llama

Inference code for facebook LLaMA models with Wrapyfi support
GNU General Public License v3.0
130 stars 14 forks source link

Running a REST API instead of "example.py" #9

Closed UmutAlihan closed 1 year ago

UmutAlihan commented 1 year ago

I see that "example.py" iteratively generates prompt answers on both machines (or instances partially loaded layers onto GPUs). Is there any possibility that I utilize multiple GPUs on multiple machines in order to deploy a rest api service, so that I can send prompts as request from other services??

fabawi commented 1 year ago

I see that "example.py" iteratively generates prompt answers on both machines (or instances partially loaded layers onto GPUs). Is there any possibility that I utilize multiple GPUs on multiple machines in order to deploy a rest api service, so that I can send prompts as request from other services??

This would be quite easy to accomplish by wrapping the generation invocation in a Wrapyfi decorated method. There are several examples on https://github.com/fabawi/wrapyfi that can help you in achieving this. Unfortunately, I don't have access to several GPUs at the moment. A PR would be welcome