liltom-eth / llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
MIT License
1.97k stars 202 forks source link

chat too slow! #69

Closed Hyingerrr closed 1 year ago

Hyingerrr commented 1 year ago

I am using Collab Pro, running on the GPU, executing the following code to ask a question, and responding for 50 seconds, which is too slow. Is there any way to accelerate?

`prompt = get_prompt("Please help me explain the TCP handshake")

res = llama2_wrapper(prompt)

print(res)`

liltom-eth commented 1 year ago

collab runs faster on gptq backend, because its CPU is super slow.

Llama-2-7b-Chat-GPTQ 4 bit Google Colab T4 5.8 GB VRAM 18.19 (tokens/sec)