liltom-eth / llama2-webui

Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
MIT License
1.97k stars 202 forks source link

Cant seem to run it on GPU #50

Closed rishabh-gurbani closed 1 year ago

rishabh-gurbani commented 1 year ago

im running this on a machine with Nvidia A100 but it doesnt seem to make use of the gpu.

System Specs : 4x Nvidia A100 80Gb 540 Gigs of ram

Benchmarks : Initialization time: 0.2208 seconds. Average generation time over 5 iterations: 31.0348 seconds. Average speed over 5 iterations: 5.0459 tokens/sec. Average memory usage during generation: 4435.30 MiB

rishabh-gurbani commented 1 year ago

can you help me w this issue?

rishabh-gurbani commented 1 year ago

llama-2-7b-chat.ggmlv3.q4_0.bin running this basic model

liltom-eth commented 1 year ago

@rishabh-gurbani Hi, you can try running env_examples/.env.13b_example or env_examples/.env.7b_gptq_example models on A100 GPU. Here is a colab example that runs GPTQ model on T4 GPU with 15.9851 tokens/sec.

liltom-eth commented 1 year ago

Do not run ggml models on server because ggml models only runs on CPU (without acceleration), and the server's CPU is super slow.

rishabh-gurbani commented 1 year ago

Alright, will try and get back, thanks!