natrys / whisper.el

Speech-to-Text interface for Emacs using OpenAI's whisper model and whisper.cpp as inference engine.
140 stars 10 forks source link

+STARTUP: showeverything

** whisper.el

Speech-to-Text interface for Emacs using OpenAI's [[https://github.com/openai/whisper][whisper speech recognition model]]. For the inference engine it uses the awesome C/C++ port [[https://github.com/ggerganov/whisper.cpp][whisper.cpp]] that can run on consumer grade CPU (without requiring a high end GPU).

You can capture audio with your local input device (microphone) or choose a media file on disk in your local language, and have the transcribed text pasted to your Emacs buffer (optionally after translating to English). This runs offline without having to use non-free cloud service for decent result (though result quality of whisper varies widely depending on language, see below).

*** Install and Usage

Aside from a C++ compiler (to compile whisper.cpp), the system needs to have =FFmpeg= for recording audio.

You can install =whisper.el= by cloning this repo somewhere, and then use it like:

+begin_src elisp

(use-package whisper :load-path "path/to/whisper.el" :bind ("C-H-r" . whisper-run) :config (setq whisper-install-directory "/tmp/" whisper-model "base" whisper-language "en" whisper-translate nil whisper-use-threads (/ (num-processors) 2)))

+end_src

You will use these functions:

Invoking =whisper-run= with a prefix argument (C-u) has the same effect as =whisper-file=.

Both of these functions will automatically compile whisper.cpp dependency and download language model the first time they are run. When recording is in progress, invoking them stops it and starts transcribing. Otherwise if compilation, download (of model file) or transcription job is in progress, calling them again cancels that.

Note for MacOS users: If whisper.el is failing silently, it might be because Emacs doesn't have the permission to use the Mic. Follow the [[https://github.com/natrys/whisper.el/wiki/MacOS-Configuration#grant-emacs-permission-to-use-mic][recipe]] in wiki to grant it explicitly.

*** Variables

Additionally, depending on your input device and system you may need to modify these variables to get recording to work:

Pulseaudio and PipeWire users who haven't further configured their =default= source may find that results are better when at least =echo cancel= filter is enabled, by loading relevant module. Then you could either set that as the default source (using e.g. =pactl=) or just use that source's name in =whisper--ffmpeg-input-device=. Furthermore, the following programs could be used to improve quality of audio recording in general:

*** Hooks

There are a few hooks provided for registering user defined actions:

(add-hook 'whisper-post-process-hook (lambda () (whisper--break-sentences 5))) ;; add a paragraph break every 5 sentences

+end_src

(add-hook 'whisper-after-insert-hook

'pipe-transcribed-audio-to-foo)

+end_src

*** Performance Guide for Advanced Users

By default, whisper.cpp performance on CPU is likely good enough for most people and most use cases. However if it's not good enough for you, here are some things you could do:

**** Update whisper.cpp

The upstream whisper.cpp is continuously improving. If you are using an old version, updating whisper.cpp is the first thing you could try. Simplest way to do that is to delete your whisper.cpp installation folder and re-run the command, which will reinstall from latest commit.

**** Quantize the model

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types. This sacrifices precision for resource efficiency. The idea is that quantized version of a bigger model may afford you to use it (if you are RAM constrained e.g.) with some penalty or accuracy, while still being more accurate hopefully than the smaller model you would be using otherwise.

**** Re-compile whisper.cpp for hardware acceleration

Offloading the encoder inference to hardware or optimised external libraries may result in speed-up. There are options to use: Core ML (for Apple hardware), cuBLAS (for NVIDIA GPU), OpenVINO (Intel CPU/GPU), CLBlast (for GPUs that support OpenCL), OpenBLAS (an optimised matrix processing library for CPU). Consult [[https://github.com/ggerganov/whisper.cpp][whisper.cpp README]] for how to re-compile whisper.cpp to enable those.

**** Use something other than whisper.cpp

If you think there is something else you want to use, you have the option to override the =whisper-command= function definition, or define an overriding advice. This function takes a path to input audio file as argument, and returns a list denoting the command (compatible to =:command= argument to [[https://www.gnu.org/software/emacs/manual/html_node/elisp/Asynchronous-Processes.html][make-process]]), to be run instead of whisper.cpp. You can use the variables described above in this readme to devise the command. The wiki [[https://github.com/natrys/whisper.el/wiki/Setup-to-use-whisper%E2%80%90ctranslate2-instead-of-whisper.cpp][contains a recipe]] that shows how to use [[https://github.com/Softcatala/whisper-ctranslate2][whisper-ctranslate2]] with whisper.el. This client is compatible to OpenAI's original one, so porting the recipe to use the original client should be possible.

Note that when you are using something other than whisper.cpp, the onus is on you to make sure the target program is properly installed and relevant model files for it are downloaded beforehand. We don't support anything other than whisper.cpp so any problems integrating them with whisper.el that's specific to those software may strain our ability to address.

*** Caveats