toverainc / willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
Apache License 2.0
353 stars 31 forks source link

Question about Hardware #145

Open guarddog13 opened 6 months ago

guarddog13 commented 6 months ago

I'm looking to upgrade my server as i want to run WIS locally and maybe alongside a LLM that WAC can fallback to. Will the GPU in the desktop I've linked below work for WIS? I run frigate on my gpu to buy that's currently on a 2gb k620 lol. So I'm not worried about frigate. I offload my plex codec to the shield. So the gpu will be mostly free for WIS.

https://www.amazon.com/gp/aw/d/B0BMW1F5DC/ref=ox_sc_saved_image_1?smid=A1GPHE9L2B1AYL&psc=1

nikito commented 6 months ago

In terms of compute it would probably be fine, just note the 6GB VRAM could get a little tight if you want to use the large model for instance, as well as more advanced TTS models that we will eventually integrate (Coqui, XTTS2). For reference, I am sitting at 4.788GB running large-v2 and XTTS2. Note this is also on an Ada Lovelace GPU which supports different quantization than the Turing architecture of the 1660, so it may take a little more ram in that case (on a GTX1070 I saw my ram usage spike as high as 6.5GB with these same models, but that is Pascal architecture).

guarddog13 commented 6 months ago

In terms of compute it would probably be fine, just note the 6GB VRAM could get a little tight if you want to use the large model for instance, as well as more advanced TTS models that we will eventually integrate (Coqui, XTTS2). For reference, I am sitting at 4.788GB running large-v2 and XTTS2. Note this is also on an Ada Lovelace GPU which supports different quantization than the Turing architecture of the 1660, so it may take a little more ram in that case (on a GTX1070 I saw my ram usage spike as high as 6.5GB with these same models, but that is Pascal architecture).

Hmmmm. I'm trying to upgrade my circa 2018 SFF and staying in a ~$500 budget.

Using remote WIS I have near instant results with HA... faster than even what my local HA assist was doing. I don't want to lose this speed by going local. Maybe get the linked one and find a used 1070 to add to it? The sff could never handle a powered GPU. I'd eventually like to run a LLM next to it so I can cut my reliance on the cloud and more importantly Google. I'm not worried about the speed of the LLM as I've found smaller models that have a 3-5 second speed on the SFF... I'm not terribly interested in speed with an AI. I need the speed with WIS, however.

Any ideas or know of anywhere selling a refurbished desktop with a 1070 or better?

nikito commented 6 months ago

The system linked may do fine, like I said in terms of compute it would outperform a 1070, it just comes down to VRAM. Given your goals I am not sure getting a 1070 on top of this system would make much sense, you'd be better off getting another 1660 or something like that down the line I think. If you plan to run local LLMs on top of WIS you'd definitely want more VRAM as most 7B models will use something like 5-7GB even with quantization, never mind the VRAM used by context and such. GPU speed and memory speed just improve your tokens/second, so if you don't care about the speed then the real focus is just getting more VRAM.

I'm not too familiar with sites that sell refurbished desktops, I tend to build my own systems 😁 But I have seen users get a used optiplex system and then put a 1070 in it and come well below $500 in terms of cost.

guarddog13 commented 6 months ago

I actually had a LLM running on the SFF with the K620. Sssssslowly but it worked. It was a model i found a HA community user using on text-based-ui. They were able to get it to work on a pi4 but very slow. It's not the best model but works for my purposes.

Have you tested WIS on the pi5? I wonder how it would?

While i have you here is there anyway to use multinet with WAC? Or do you plan to integrate WAC into WAS at some point?

kristiankielhofner commented 6 months ago

Have you tested WIS on the pi5? I wonder how it would?

WIS and faster-whisper use the same underlying engine (ctranslate2). WIS is slightly faster because of some other optimization work we've done + optimized for latency and short speech segments. I'm not aware of any projects that do Whisper with any kind of special acceleration on ARM platforms other than NEON which are just vector instructions for ARM (ctranslate2 already does this). Ctranslate2 has other acceleration frameworks for x86_64 CPU but it doesn't even support AMD or Intel GPUs and they've gone on record saying they haven't even considered it...

If you look at the comparison benchmarks and look around online it seems the Pi 5 is roughly twice as fast as the Pi 4 with Whisper.

So for our standard 3.8 test speech segment and minimum recommended model (small) a 3.8 second speech segment still takes roughly 25 seconds on the Pi 5. A GTX 1070 does medium in 424ms and medium is significantly "slower" than small. It's just not even close and never will be so we have no plans to support these ARM based platforms with WIS.

We're removing all support for multinet in upcoming versions. We've found it to be impractical for typical speech commands (proper nouns of entity names, etc).

WAC is already integrated in WAS in a development branch and it will be in an upcoming release candidate.