toverainc / willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
Apache License 2.0
382 stars 35 forks source link

Error when running download_models.sh #89

Closed nirnachmani closed 1 year ago

nirnachmani commented 1 year ago

I receive the following error when running download_models.sh:

Using configuration overrides from .env file
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

I don't have a dedicated GPU and was hoping to run it on the CPU. Do I receive this error because I don't have a GPU? Is it possible to run WIS without a GPU currently? Or there is another issue?

nirnachmani commented 1 year ago

Reading through the readme, I get confusing messages about running WIS without GPU:

So, is it possible to run without a GPU? Is the error I am getting (in first post) related to not having a GPU? Or is it something else? How do I go about setting WIS up without GPU?

kristiankielhofner commented 1 year ago

Sorry for the slow response - I somehow missed your first message!

We will soon be releasing WIS 1.0 with drastically improved support for CPU-only configurations. I suggest trying the pre-release:


git checkout wisng```

Follow the guide in the README and WIS should start without issue on your configuration.
nirnachmani commented 1 year ago

Thanks.

I run git checkout wisng and then followed the rest of the instructions. However, when I run ./utils.sh install I get a similar error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Then, if I try to run ./utils.sh run I get:

Using configuration overrides from .env file
Models not found. You need to run ./utils.sh download-models - exiting

And if I try to run ./utils.sh download-models I get the same message:

Using configuration overrides from .env file
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Is this related to my previous attempt to run the older version? I first deleted the old directory and re-cloned before checking out wisng.

nirnachmani commented 1 year ago

It's working now - under Detect GPU support in utils.sh I forced

DOCKER_GPUS=""
DOCKER_COMPOSE_FILE="docker-compose-cpu.yml"
nirnachmani commented 1 year ago

Well, it is working. I am probably not telling you something you don't know, but it is significantly slower than with the Tovera hosted best-effort WIS that you are providing. My WIS is running on an Intel NUC with an i7-6770hq and it takes a good 14 seconds to respond to a simple commands (turn off a light), compared to 2 seconds with the hosted WIS...

Issue is, I can't put a GPU in the NUC, and a eGPU case makes the whole thing much more expensive.

Anyway, great work and thank you very much for working on the project.

kristiankielhofner commented 1 year ago

A couple of things:

1) If you could test it the wisng branch has a dramatically improved version of WIS with improved CPU support (auto-detect, no hacks necessary). This is the version for the upcoming Willow/WIS/WAS 1.0 release.

2) We talk about it a bit in the README: GPUs are so fundamentally (from a physical architecture standpoint) better at tasks like speech recognition. A $100 six year old Nvidia GPU will handily beat the fast CPUs on the market at a fraction of the cost and power. We use the same fundamental implementation as faster-whisper (ours is slightly faster in my testing) and there is another CPU optimized implementation called whisper.cpp. WIS, faster-whisper, and whisper.cpp (the fastest Whisper implementations in the world) on CPU cannot even remotely come close to a GPU. Alexa uses GPUs. Google home uses GPUs (possibly TPUs). Siri uses Apple Neural (with dedicated silicon support) on device and almost certainly GPU in the cloud.

You simply cannot provide the level of speech recognition quality and speed as these commercial devices running on CPU. I understand with a NUC a GPU is infeasible but "it is what it is".

That said, you at least have a decent CPU (there are approaches trying to do voice assistants on a Raspberry Pi, which is ridiculous and a non-starter in my opinion). With wisng you can try different models that trade accuracy for speed.

In your Willow Inference Server URL configuration you can append the model parameter:

http://your-willow-host:your-port?model=your-model-to-try

Where model can be (in order from highest quality/slowest to lowest quality/fastest): large, medium (our default), small, base, or tiny (the default for the Rhasspy/Home Assistant implementation). As you go down the models the speed improves dramatically but the quality drops dramatically too.

We'd definitely be interested to hear your feedback: benchmarking for CPUs is extremely hard because compared to GPU (a Tesla P4 is a Tesla P4, a GTX 1070 is more-or-less a GTX 1070) there are so many CPU variants, memory configurations, etc it's very difficult to predict performance on CPU.

kristiankielhofner commented 1 year ago

Update: I added some CPU benchmarks to the benchmarks table in wisng. As you can see the fastest CPU I have available (AMD ThreadRipper Pro 5955WX) is 5x slower than a GTX 1070 with our default settings (model medium, beam 1). It's not until you get to the base model on this CPU where you can meet our latency goal of sub-1 second (local) processing times. You're likely getting roughly two seconds on our hosted implementation because it has other load and there is internet latency involved.

To give you an idea of absurd performance with local absurd GPU: in my home environment (with an RTX 3090) I see less than 300ms (current record is 212ms) between end of speech and Home Assistant completing the action. Even a GTX 1070 can do less than 500ms.

nirnachmani commented 1 year ago

Thank you for the information. I did read about the poor performance with CPU so I had low exceptions, however, I didn't expect 14 seconds. By the way, this was with the wisng branch, I believe - I run git checkout wisng before going through the setup. I'll try using different models as you suggested to see what difference it makes. Maybe eventually I'll invest in eGPU case and a GTX 1070.

kristiankielhofner commented 1 year ago

From what I can tell you must have an older version - did you git pull as well?

Yes, for highly parallel tasks like speech recognition CPU performance is fundamentally terrible by comparison. The GTX 1070 has 1920 cores and 256 GB/s memory bandwidth. The RTX 3090 is 10496 cores and 936 GB/s.

It's not even close - this is why we emphasize "just use a GPU" so heavily.