turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

Test_inference for a single prompt on an array of input texts #282

Closed tednas closed 3 months ago

tednas commented 8 months ago

Suppose we have an array of 1000 input texts and we want to apply a single prompt on all of them, Let's say applying sentiment on all 1000 text samples. Of course, we can run the inference one by one for each single input text which leads to 1000 times inference run; But to accelerate and have parallel processing using the GPU, can we apply test_inference on a full batch (or for example on a micro batches of 50 input text simultaneously considering potential CUDA memory usage issue)

I see a test sample using model.forward() but not sure if we can use a similar approach for the above use case.

Thanks in advance.

turboderp commented 8 months ago

I've added an example of how you can do batched inference here.

Note that the model itself just takes a tensor of input IDs, which may be batched, and the sampler can take a batched tensor of logits so there are other ways to approach batching as well.

tednas commented 8 months ago

Thank you so much, your maintenance is awesome!