Integrating TTS open-source solutions (Custom coqui-TTS, AllTalkTTS,..)

240db commented 1 month ago

Hey I just ran across this on Reddit, awesome initiative and thanks for sharing it.

I was just checking the notebook for the implementation, it appears to be using OpenAI or Eleven Labs for the TTS ? I was thinking of implementing a project like coqui-TTS or AllTalkTTS. We are using Veed.io for production, but im about to upgrade coqui-tts fine tuning it for portuguese as well as fine tuning it to specific speakers to get some specialized voices; that way I can go back to using a solution with API.

Just by making the coqui-TTS run with a gradio interface, its already callable with python or javascript, so it should be easy to implement baseline coqui-TTS. I will try to provide the code later this evening I hope.

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

240db commented 1 month ago

this is a rough sketch,to get TTS setup, you have to get python 3.9 or later. I used pyenv for this:

install python 3.9.17 and build TTS from source

pyenv install 3.9.17
pyenv local 3.9.17
mkdir TextToSpeech
cd TextToSpeech && git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .[all,dev,notebooks]  # Select the relevant extras

then modify file text_to_speech.py to something like this:

"""
Text-to-Speech Module with Coqui TTS integration

This module provides functionality to convert text into speech using various TTS models,
including Coqui TTS, ElevenLabs, and OpenAI TTS services.
"""

import logging
from podcastfy.utils.config import load_config
from pydub import AudioSegment
import os
import re
from typing import List, Tuple, Optional, Union
from TTS.api import TTS  # Import Coqui TTS

logger = logging.getLogger(__name__)

class TextToSpeech:
    def __init__(self, model: str = 'coqui', api_key: Optional[str] = None):
        """
        Initialize the TextToSpeech class.

        Args:
            model (str): The model to use for text-to-speech conversion. 
                         Options are 'coqui', 'elevenlabs' or 'openai'. Defaults to 'coqui'.
            api_key (Optional[str]): API key for ElevenLabs or OpenAI services (not needed for Coqui).
        """
        self.model = model.lower()
        self.config = load_config()
        self.tts_config = self.config.get('text_to_speech')

        self.audio_format = self.tts_config['audio_format']
        self.temp_audio_dir = self.tts_config['temp_audio_dir']
        self.ending_message = self.tts_config['ending_message']

        if self.model == 'coqui':
            # Initialize Coqui TTS with a pre-trained model
            self.coqui_tts = TTS(self.tts_config['coqui']['model_name'])
        elif self.model == 'elevenlabs':
            self.api_key = api_key or self.config.ELEVENLABS_API_KEY
            self.client = elevenlabs_client.ElevenLabs(api_key=self.api_key)
        elif self.model == 'openai':
            self.api_key = api_key or self.config.OPENAI_API_KEY
            openai.api_key = self.api_key
        else:
            raise ValueError("Invalid model. Choose 'coqui', 'elevenlabs', or 'openai'.")

        # Create temp_audio_dir if it doesn't exist
        if not os.path.exists(self.temp_audio_dir):
            os.makedirs(self.temp_audio_dir)

    def __merge_audio_files(self, input_dir: str, output_file: str) -> None:
        """Merge all audio files in the input directory sequentially and save the result."""
        try:
            def natural_sort_key(filename: str) -> List[Union[int, str]]:
                return [int(text) if text.isdigit() else text for text in re.split(r'(\d+)', filename)]

            combined = AudioSegment.empty()
            audio_files = sorted(
                [f for f in os.listdir(input_dir) if f.endswith(f".{self.audio_format}")],
                key=natural_sort_key
            )
            for file in audio_files:
                if file.endswith(f".{self.audio_format}"):
                    file_path = os.path.join(input_dir, file)
                    combined += AudioSegment.from_file(file_path, format=self.audio_format)

            combined.export(output_file, format=self.audio_format)
            logger.info(f"Merged audio saved to {output_file}")
        except Exception as e:
            logger.error(f"Error merging audio files: {str(e)}")
            raise

    def convert_to_speech(self, text: str, output_file: str) -> None:
        """
        Convert input text to speech and save as an audio file.

        Args:
            text (str): Input text to convert to speech.
            output_file (str): Path to save the output audio file.
        """
        cleaned_text = self.clean_tss_markup(text)

        if self.model == 'coqui':
            self.__convert_to_speech_coqui(cleaned_text, output_file)
        elif self.model == 'elevenlabs':
            self.__convert_to_speech_elevenlabs(cleaned_text, output_file)
        elif self.model == 'openai':
            self.__convert_to_speech_openai(cleaned_text, output_file)

    def __convert_to_speech_coqui(self, text: str, output_file: str) -> None:
        """Convert text to speech using Coqui TTS."""
        try:
            qa_pairs = self.split_qa(text)
            audio_files = []
            counter = 0

            for question, answer in qa_pairs:
                for speaker, content in [("question", question), ("answer", answer)]:
                    counter += 1
                    file_name = f"{self.temp_audio_dir}{counter}.{self.audio_format}"

                    # Coqui TTS synthesizes the speech and saves directly to the file
                    tts_voice = self.tts_config['coqui']['voices'][speaker]
                    self.coqui_tts.tts_to_file(text=content, speaker=tts_voice, file_path=file_name)

                    audio_files.append(file_name)

            # Merge the individual audio files
            self.__merge_audio_files(self.temp_audio_dir, output_file)

            # Clean up temporary audio files
            for file in audio_files:
                os.remove(file)

            logger.info(f"Audio saved to {output_file}")

        except Exception as e:
            logger.error(f"Error converting text to speech with Coqui TTS: {str(e)}")
            raise

    # Other methods remain unchanged (e.g., __convert_to_speech_elevenlabs, __convert_to_speech_openai, split_qa, clean_tss_markup)

def main(seed: int = 42) -> None:
    """Main function to test the TextToSpeech class with Coqui TTS."""
    try:
        config = load_config()

        # Read input text
        with open('tests/data/response.txt', 'r') as file:
            input_text = file.read()

        # Test Coqui TTS
        tts_coqui = TextToSpeech(model='coqui')
        coqui_output_file = 'tests/data/response_coqui.mp3'
        tts_coqui.convert_to_speech(input_text, coqui_output_file)
        logger.info(f"Coqui TTS completed. Output saved to {coqui_output_file}")

    except Exception as e:
        logger.error(f"An error occurred during text-to-speech conversion: {str(e)}")
        raise

if __name__ == "__main__":
    main(seed=42)

souzatharsis commented 1 month ago

this sounds good; would you like to push a PR? @240db

souzatharsis commented 1 month ago

We are just about to add support for Microsoft Edge's TTS https://github.com/souzatharsis/podcastfy/pull/46

240db commented 1 month ago

this sounds good; would you like to push a PR? @240db

Yeah, sure, I would try to provide a fork for the dependencies and some documentation first. I am still trying an open source pipeline but what is good here is the ability to use xtts from coqui-tts which is multilingual, then we can finetune models for it using a gradio webui, and import back to what i think will be a gui to choose from OpenAI, ElevenLabs or several xtts models fine-tuned to different speakers. I also need to look into VITS because it is used in RVC as a dependency, and RVC is not very stable over here yet. If we get the VITS to work is a win.

On a different note, to generate music or effects, you can also add audiocraft but there are lots of considerations as this is more like a dedicated solution / route as opposed to implementing a cloud solution that can possibly run from simpler computer clients

souzatharsis commented 4 weeks ago

Another advantage besides open and free of xtts_v2 is that you can enable voice cloning.

https://huggingface.co/coqui/XTTS-v2

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
                file_path="output.wav",
                speaker_wav="/path/to/target/speaker.wav",
                language="en")

240db commented 3 weeks ago

Hey @souzatharsis, sorry for the wait. Yeah, so the coqui TTS is the framework and they have the xtts model that can do quick one-shot cloning. The quality is okay and ideal for low vram. Let me show you a sample https://open.spotify.com/episode/2vxZFcVyG9Q280cDUVsnGs

There is a higher quality model alternative by using a fine tuned model. Here's a test I have done using the fine tuning with coqui-tts. . I used audio from this season of the show where the speaker recorded the show. The model is very sensitive to any noise, reverb rt60, so you can prepare the dataset with the best samples to make the model much cleaner. It does not require a lot of data, 15 minutes can get you started. I have larger datasets but I havent trained them yet, this model used around 5 hours but im gonna try other things too - I will put this at the end...

Getting fine tuned model

Here is how you would do it, Coqui's framework allows to train from scratch and fine tune the baseline models, so we can train and load the fine tuned model with coqui's TTS python library. XTTS is already multilingual and very robust, I am demonstrating only XTTSv2.0.2 here. You basically load your fine tuned model as the main model, it will capture a lot more characteristics.

Tuning a model

so you can train with a gradio app like this, and since its gradio, it could run in a separate environment isolated and serve the 'training tab' integrated into podcastify through the api endpoints with gradio_client.

Inference

Here is how one would load your model (from the XTTS doc) :

import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading model...")
config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=True)
model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])

print("Inference...")
out = model.inference(
    "It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
    "en",
    gpt_cond_latent,
    speaker_embedding,
    temperature=0.7, # Add custom parameters here
)
torchaudio.save("xtts.wav", torch.tensor(out["wav"]).unsqueeze(0), 24000)

So like the first example: instead of using that quick one-shot cloning, you may use the tuned xtts model for a much better quality. You can fine tune in a matter of minutes, you just need at least 16gbVRAM, once you have that, inference will require less than 12gb VRAM.

Other techniques

https://github.com/240db/Retrieval-based-Voice-Conversion-WebUI

So another way to do the cloning much more efficiently is to actually use a model, something like eleven labs, chatgpt, or even your fine tuned XTTS then RVC it (using another model) ... to make it follow the pitch but with the modeled voice. It is a lot more buggy but the inference is at least 50x faster, it's incredible and they have a real-time app that should work too.

I forked the project because the gradio version it had as buggy, so I changed the requirements file, it's not very active so as of now my fork runs properly

souzatharsis / podcastfy