oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.68k stars 5.21k forks source link

Adjust prompt batch size for Exllama V2 #6016

Open Dampfinchen opened 4 months ago

Dampfinchen commented 4 months ago

By default Exllama V2 uses a batch of 2048 for prompt processing, which adds a ton of VRAM usage. On TabbyAPI and ExGUI it is possible to set the prompt processing batch to 1024 and 512. Those decrease VRAM usage dramatically, which makes for example a 4 bit 8B useable at 4K context and at 4 bit in 6 GB VRAM while still offering very fast speeds.

EX2 on Ooba is currently not useable on my configuration because of the default batch size.

@oobabooga Please allow users to set the batch size for Exllama V2 prompt processing like its possible with llama.cpp.

xStarryNight commented 4 months ago

agree, I can't run most models because of this so I have to use exui instead.

Ph0rk0z commented 4 months ago

Oh interesting.. we can change this now? There is also a reserved size for auto split.