Running a REST API instead of "example.py"

modular-ml / wrapyfi-examples_llama

Inference code for facebook LLaMA models with Wrapyfi support

GNU General Public License v3.0

130 stars 14 forks source link

I see that "example.py" iteratively generates prompt answers on both machines (or instances partially loaded layers onto GPUs). Is there any possibility that I utilize multiple GPUs on multiple machines in order to deploy a rest api service, so that I can send prompts as request from other services??

This would be quite easy to accomplish by wrapping the generation invocation in a Wrapyfi decorated method. There are several examples on https://github.com/fabawi/wrapyfi that can help you in achieving this. Unfortunately, I don't have access to several GPUs at the moment. A PR would be welcome

modular-ml / wrapyfi-examples_llama

Running a REST API instead of "example.py" #9