By default Exllama V2 uses a batch of 2048 for prompt processing, which adds a ton of VRAM usage. On TabbyAPI and ExGUI it is possible to set the prompt processing batch to 1024 and 512. Those decrease VRAM usage dramatically, which makes for example a 4 bit 8B useable at 4K context and at 4 bit in 6 GB VRAM while still offering very fast speeds.
EX2 on Ooba is currently not useable on my configuration because of the default batch size.
@oobabooga Please allow users to set the batch size for Exllama V2 prompt processing like its possible with llama.cpp.
By default Exllama V2 uses a batch of 2048 for prompt processing, which adds a ton of VRAM usage. On TabbyAPI and ExGUI it is possible to set the prompt processing batch to 1024 and 512. Those decrease VRAM usage dramatically, which makes for example a 4 bit 8B useable at 4K context and at 4 bit in 6 GB VRAM while still offering very fast speeds.
EX2 on Ooba is currently not useable on my configuration because of the default batch size.
@oobabooga Please allow users to set the batch size for Exllama V2 prompt processing like its possible with llama.cpp.