:warning: This project is in its early stages and is currently under active development. Features are unstable, code is messy, and breaking changes will occur. The main goal of this stage is to build a minimum viable prototype using technologies that are easy to integrate.
:warning: This project currently has a lot of issues on Windows. In theory, it should all work, but many people using Windows are having many problems with many dependencies now. Those issues will probably be fixed in the future, but Windows support currently requires testing and debugging. If you have a Mac or a Linux machine, use them instead for the time being. Join the Discord server if you are having trouble or just want to talk.
:warning: If you want to run this program on a server and access it remotely on your laptop, the microphone on the front end will only launch in a secure context (a.k.a. https or localhost). See MDN Web Doc. Therefore, you might want to either configure https with a reverse proxy or launch the front end locally and connects to the server via websocket (untested). Open the
static/index.html
with your browser and set the ws url on the page.
Open-LLM-VTuber allows you to talk to any LLM by voice (hands free) locally with a Live2D talking face. The LLM inference backend, speech recognition, and speech synthesizer are all designed to be swappable. This project can be configured to run offline on macOS, Linux, and Windows ().
Long-term memory with MemGPT can be configured to achieve perpetual chat, infinite* context length, and external data source.
This project started as an attempt to recreate the closed-source AI VTuber neuro-sama
with open-source alternatives that can run offline on platforms other than Windows.
https://github.com/t41372/Open-LLM-VTuber/assets/36402030/e8931736-fb0b-4cab-a63a-eea5694cbb83
Currently supported LLM backend
Currently supported Speech recognition backend
MIC_IN_BROWSER
in the conf.yaml
to move the microphone (and voice activation detection) to the browser (at the cost of latency, for now). You might want to use the microphone on your client (the browser) rather than the one on your server if you run the backend on a different machine or inside a VM or docker.Currently supported Text to Speech backend
Fast Text Synthesis
Live2D Talking face
config.yaml
(model needs to be listed in model_dict.json)doc/live2d.md
for documentation.live2d technical details
live2d-models
directory. The default shizuku-local
is stored locally and works offline. If the URL property of the model in the model_dict.json is a URL rather than a path starting with /live2d-models
, they will need to be fetched from the specified URL whenever the front end is opened. Read doc/live2d.md
for documentation on loading your live2D model from local.server.py
to run the WebSocket communication server, open the index.html
in the ./static
folder to open the front end, and run launch.py
main.py
to run the backend for LLM/ASR/TTS processing.New installation instruction is being created here
Install FFmpeg on your computer.
Clone this repository.
You need to have Ollama or any other OpenAI-API-Compatible backend ready and running. If you want to use MemGPT as your backend, scroll down to the MemGPT section.
Prepare the LLM of your choice. Edit the BASE_URL and MODEL in the project directory's conf.yaml
.
This project was developed using Python 3.10.13
. I strongly recommend creating a virtual Python environment like conda for this project.
Run the following in the terminal to install the dependencies.
pip install -r requirements.txt # Run this in the project directory
# Install Speech recognition dependencies and text-to-speech dependencies according to the instructions below
This project, by default, launches the audio interaction mode, meaning you can talk to the LLM by voice, and the LLM will talk back to you by voice.
Edit the conf.yaml
for configurations. You can follow the configuration used in the demo video.
If you want to use live2d, run server.py
to launch the WebSocket communication server and open the URL you set in conf.yaml
(http://HOST:PORT
). By default, go to http://localhost:8000
.
Run launch.py
main.py
with Python. Some models will be downloaded during your first launch, which may take a while.
Also, the live2D models have to be fetched through the internet, so you'll have to keep your internet connected before the index.html
is fully loaded with your desired live2D model.
Back up the configuration files conf.yaml
if you've edited them, and then update the repo.
Or just clone the repo again and make sure to transfer your configurations. The configuration file will sometimes change because this project is still in its early stages. Be cautious when updating the program.
Edit the ASR_MODEL settings in the conf.yaml
to change the provider.
Here are the options you have for speech recognition:
FunASR
(local) (Runs very fast even on CPU. Not sure how they did it)
pip install -U funasr modelscope huggingface_hub
. Also, make sure you have torch (torch>=1.13) and torchaudio. Install them with pip install torch torchaudio
Faster-Whisper
(local)
WhisperCPP
(local) (runs super fast on a Mac if configured correctly)
pip install pywhispercpp
.WhisperCPP coreML configuration:
pywhispercpp
if you have already installed it. We are building the package.install_coreml_whisper.py
with Python to automatically clone and build the coreML-supported pywhispercpp
for you.conf.yaml
. For example, if the coreml model's name looks like ggml-base-encoder.mlmodelc
, just put base
into the model_name
under WhisperCPP
settings in the conf.yaml
.Whisper
(local)
pip install -U openai-whisper
AzureASR
(online, API Key required)
pip install azure-cognitiveservices-speech
.Install the respective package and turn it on using the TTS_MODEL
option in conf.yaml
.
pyttsx3TTS
(local, fast)
pip install py3-tts
.sapi5
on Windows, nsss
on Mac, and espeak
on other platforms.py3-tts
is used instead of the more famous pyttsx3
because pyttsx3
seems unmaintained, and I couldn't get the latest version of pyttsx3
working.meloTTS
(local, fast)
mecab-python
, try this fork (hasn't been merge into main as of July 16, 2024).barkTTS
(local, slow)
pip install git+https://github.com/suno-ai/bark.git
and turn it on in conf.yaml
.cosyvoiceTTS
(local, slow)
conf.yaml
to match your desired configurations. Check their webui and the API documentation on the webui to see the meaning of the configurations under the setting cosyvoiceTTS
in the conf.yaml
.edgeTTS
(online, no API key required)
pip install edge-tts
and turn it on in conf.yaml
.AzureTTS
(online, API key required)
Create a file named api_keys.py
in the project directory, paste the following text into the file, and fill in the API keys and region you gathered from your Azure account.
# Azure API key
AZURE_API_Key="YOUR-API-KEY-GOES-HERE"
# Azure region
AZURE_REGION="YOUR-REGION"
# Choose the Text to speech model you want to use
AZURE_VOICE="en-US-AshleyNeural"
If you're using macOS, you need to enable the microphone permission of your terminal emulator (you run this program inside your terminal, right? Enable the microphone permission for your terminal). If you fail to do so, the speech recognition will not be able to hear you because it does not have permission to use your microphone.
MemGPT integration is very experimental and requires quite a lot of setup. In addition, MemGPT requires a powerful LLM (larger than 7b and quantization above Q5) with a lot of token footprint, which means it's a lot slower. MemGPT does have its own LLM endpoint for free, though. You can test things with it. Check their docs.
This project can use MemGPT as its LLM backend. MemGPT enables LLM with long-term memory.
To use MemGPT, you need to have the MemGPT server configured and running. You can install it using pip
or docker
or run it on a different machine. Check their GitHub repo and official documentation.
:warning: I recommend you install MemGPT either in a separate Python virtual environment or in docker because there is currently a dependency conflict between this project and MemGPT (on fast API, it seems). You can check this issue Can you please upgrade typer version in your dependancies #1382.
Here is a checklist:
memgpt
using memgpt server
command. Remember to have the server running before launching Open-LLM-VTuber.model_dict.json
) into MemGPTserver admin password
and the Agent id
into ./llm/memgpt_config.yaml
LLM_PROVIDER
to memgpt
in conf.yaml
. memgpt
, all LLM-related configurations in conf.yaml
will be ignored because memgpt
doesn't work that way.PortAudio
Missing
libportaudio2
to your computer via your package manager like apt:warning: This is highly experimental, totally untested (because I use a mac), and totally unfinished. If you are having trouble with all the dependencies, however, you can try to have trouble with the container instead, which is still a lot of trouble but is a different set of trouble, I guess.
Current issues:
Setup guide:
Review conf.yaml
before building (currently burned into the image, I'm sorry):
MIC_IN_BROWSER
to true (required because your mic doesn't live inside the container)Build the image:
docker build -t open-llm-vtuber .
(Grab a drink, this may take a while)
Run the container:
docker run -it --net=host -p 8000:8000 open-llm-vtuber "sh"
Inside the container, run:
server.py
launch.py
main.py
(Use screen, tmux, or similar to run server.py and main.py simultaneously)Open localhost:8000 to test
(this project is in the active prototyping stage, so many things will change)
Some abbreviations used in this project:
TTSInterface
defined in tts/tts_interface.py
.tts_factory
: the factory to instantiate and return the tts instance.conf.yaml
. The dict with the same name will be passed into the constructor of your TTSEngine as kwargs.ASRInterface
defined in asr/asr_interface.py
.asr_factory
: the factory to instantiate and return the ASR instance.conf.yaml
. The dict with the same name will be passed into the constructor of your class as kwargs.LLMInterface
defined in llm/llm_interface.py
.llm_factory
: the factory to instantiate and return the LLM instance.conf.yaml
. The dict with the same name will be passed into the constructor of your class as kwargs.Awesome projects I learned from