neurokitti / AIRIS-VtuberAI

27 stars 1 forks source link

Screenshot 2024-07-19 002209

Airis: Local Vtuber AI

Airis-VtuberAI is a open source attempt to recreate the populer Vtuber "Neuro Sama". The project utilises no APIs and can run entirely localy without a need for an internet connection or considerable Vram.

the project includes the ability to transcribe the users voice, generate a response, and synthisise a text2speach output with as litle latency as resonable posible while sacraphising as little quality as posible.

Features

Table of Contents

Installation

tutorial (this is now outdated but may still help some)

first clone this repository and then clone the OpenVoice TTS repository

git clone https://github.com/neurokitti/AIRIS-VtuberAI.git
cd AIRIS-VtuberAI
git clone https://github.com/myshell-ai/OpenVoice.git

next create a .venv and install install the requirments.txt (the one from this repo not the OpenVoice repo)

pip install -r requirements.txt

next install pytorch here next you can deleat all the files (not the folders) in the OpenVoice folder. then drag the files from the Vtuber Project into the open voice repository. dont drag the system prompt files into the repo though.

image

finnaly install OBS Websocket here and set the websocket pasword to the be the same as the one in the startup_scripts.py file.

Usage

To run this project you can simply run the main file. to run interview mode just uncoment it.

from startup_scripts import main_chat, main_interview

if __name__ == "__main__":
    main_chat() #this will run a chat mode that will interact with the chat but will not respond to you
    #main_interview() # this will not read chat but instead respond to anyone on the stream over mic

you may also want to edit the project to better suit your needs. in that case navigate to the startup_scripts.py file.

finnaly to run the project run the main.py file with the mode you want uncomented

Benchmark

UPDATE: i tested this on a GTX 745 (4 gigs VRAM) and had about 7 seconds of delay. The Metrics in this section include the full project including the overhead from running OBS, and Vtube Studio. All of these test were run on GPU and used the phi 3 mini 4k instruct model from microsoft.

NOTE: Because I have fully tested response time for reference its between 1 and 2 seconds

Time to First token: Interview Mode

Whisper Model Precision Language Model Quantization Max. GPU memory Response Time
tiny int8_float16 Phi-3-mini-4k-instruct 4-bit tbd time tbd
tiny int8_float16 Phi-3-mini-4k-instruct 8-bit tbd time tbd
tiny int8_float16 Phi-3-mini-4k-instruct full tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct 4-bit tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct 8-bit tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct full tbd time tbd

Executed with CUDA 12.1 on a NVIDIA Laptop RTX 4080 with 12 GB of VRAM.

Time to First token: Chat Mode

Whisper Model Precision Language Model Quantization Max. GPU memory Response Time
tiny int8_float16 Phi-3-mini-4k-instruct 4-bit tbd time tbd
tiny int8_float16 Phi-3-mini-4k-instruct 8-bit tbd time tbd
tiny int8_float16 Phi-3-mini-4k-instruct full tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct 4-bit tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct 8-bit tbd time tbd
distil-large-v3 int8_float16 Phi-3-mini-4k-instruct full tbd time tbd

Executed with CUDA 12.1 on a NVIDIA Laptop RTX 4080 with 12 GB of VRAM.

Comming Soon

Join Our Community That Doesnt Exist

Discord Youtube

Contact

neurokitti42@gmail.com