lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.28k stars 423 forks source link

taking about 40 minutes to generate one sentence,Is this speed normal? #186

Open kingdoom1 opened 1 month ago

kingdoom1 commented 1 month ago

I have set the input maxlength to 128 and the output maxlength to 128 as well. The speed of output is very slow, taking about 40 minutes to generate one sentence. I am using the Qwen-2.5 7B model. Is this speed normal? My GPU is an NVIDIA 3090 with 12GB of VRAM, and it's using around 5GB.

parsa-pico commented 1 month ago

same here with RTX A4000 using llama3:8b

ggaaooppeenngg commented 1 month ago

I guess a splitted group of layer is around 4GB,is there a way to load more groups once a time for those GPUs having more VRAM?