guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.85k stars 1.04k forks source link

Batched Inference to Improve GPU Utilisation #493

Open lachlancahill opened 10 months ago

lachlancahill commented 10 months ago

Is your feature request related to a problem? Please describe. When using this library in a loop, I am getting poor GPU Utilisation running zephyr-7b.

Describe the solution you'd like It would be fantastic to be able to pass a list of prompts to a function of the Transformers class, and define a batch size like you can for a huggingface pipeline. This significantly improves speed and GPU utilisation.

Additional context GPU utilisation for reference: image

drachs commented 10 months ago

I feel like this is very important, if they don't implement batch inferencing I can't really consider it over llama.cpp's GBNF grammers.

darrenangle commented 10 months ago

+1

@drachs does GBNF in ggml support batched inference with different grammar constraints per generation in the batch? is that even possible? would love some guidance, if you please.

drachs commented 10 months ago

I'm not very strong on the theory, but llama.cpp does support continuous batch inference with a grammar file. It had grammar support and continuous batching support for a while, but my understanding is it didn't start working together until this PR, maybe some clues in there: https://github.com/ggerganov/llama.cpp/pull/3624

You can try it out yourself, here are some instructions on how to use this with Docker from my notes. Note that the version in the public docker images doesn't work, I assume they must have been published prior to the fix in October.

git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile . docker run --gpus all -v .:/models -it --entrypoint /bin/bash local/llama.cpp:full-cuda ./parallel -m /models/ --grammar-file grammars/json.gbnf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

Jbollenbacher commented 9 months ago

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

lachlancahill commented 9 months ago

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

Thanks, that's interesting to know.

I think it's unlikely to be a memory bandwidth issue. The 3090 is 90-100% utilised when using the same model via huggingface transformers (with much better throughput).

To speculate myself, I'm expecting the issue could be that much of the processing done in this library is CPU bound, so when running in a loop, the GPU is waiting while the CPU bound processes are being performed, then the CPU waits while the GPU inference is being run. This is why it would be great to see an implementation of batch inference, so that while the CPU is processing the output of the first item, the GPU can begin running inference on the next. That way, they aren't waiting for each other to finish and can work at the same time.

freckletonj commented 9 months ago

:+1: Batch inference would greatly unlock synthetic data.

edit: in the meantime, outlines offers constrained gen and batch inference.

CarloNicolini commented 6 months ago

Any idea on how to perform batch inference? This applies especially in the context of applying guidance to many data in parallel.