microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
https://llmlingua.com/
MIT License
4.27k stars 228 forks source link

CUDA out of memory #66

Open deltawi opened 6 months ago

deltawi commented 6 months ago

I have 4 GPUs RTX A5000 with 24GB memory each, but when I run the example code:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

I get the error:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It seems not able to run it on multiple GPUs.

iofu728 commented 6 months ago

Hi @deltawi, if you use the GPTQ 7b model, you will need less than 8GB of GPU memory.

Additionally, if you need to use multiple GPUs, you can use the following command:

llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", device_map="balanced", model_config={"revision": "main"})