mlc-ai / web-llm

High-performance In-browser LLM Inference Engine
https://webllm.mlc.ai
Apache License 2.0
13.42k stars 860 forks source link

weird observation #177

Closed earonesty closed 1 year ago

earonesty commented 1 year ago

amazing, im running vicuna 7b in the browser and getting pretty decent performance. so, for comparison, i decided to spin up a p2 instance and see how a k80 runs vicuna 7b. .... its slower. what? some expensive tesla is slower than my amd radeon? yep, inference for vicuna 7b is about 3x faster on my laptop running webgpu. double-checked im running the same quantized model. double-checked that pytorch is really using the gpu (nvidia-smi utilization at 99%)

ok, lets compare to the geforce rtx accelerator on my gaming laptop. convinced chrome to see it instead of my amd. ... also slower inference.

why is my built-in cheapo amd radeon doing inference so much faster via webgpu? should i run out and buy a beefier radeon?

earonesty commented 1 year ago

well, i confirmed this with llama-cpp-python. my build-in radeon is just better at inference. also it has direct access to all the system ram (32GB) ... so it can load the whole model into ram. unlike the nvidia, which runs out, and has to offload work to the cpu. the ability to "borrow" slower system ram is amazing for inference