rendchevi / nix-tts

🐤 Nix-TTS: Lightweight and End-to-end Text-to-Speech via Module-wise Distillation
MIT License
233 stars 31 forks source link

Audio longer than 15 seconds? #16

Open JHenzi opened 1 year ago

JHenzi commented 1 year ago

I like the voice this model creates, however I can't get it to output fluid speech that is longer than 15 seconds. When it does, it starts to get garbled and loses fidelity. Could be OOM, but not getting a kill or system hang.

Using this code by the way;

from nix.models.TTS import NixTTSInference
# from IPython.display import Audio
import soundfile as sf
import wave
import numpy as np

# Initiate Nix-TTS
nix = NixTTSInference(model_dir="/docker/nix-tts/")
# Load the prompt.txt file from the local directory; this file contains the text to be spoken by the model
with open('/docker/nix-tts/prompt.txt', 'r') as file:
    prompt_text = file.read()

# Tokenize input text
c, c_length, phoneme = nix.tokenize(prompt_text)
# Convert text to raw speech
xw = nix.vocalize(c, c_length)

# Listen to the generated speech
# Audio(xw[0, 0], rate=22050)

with wave.open('output.wav', 'wb') as wav_file:
    wav_file.setnchannels(1)
    wav_file.setsampwidth(2)
    wav_file.setframerate(22050)
    wav_file.writeframes((2 ** 15 * xw).astype(np.int16).tobytes())

Tempted to create 15 second audio files in order of how I want to recreate them and have it generate the entire script I'm working on for a video, I enjoy the pronunciation of lots of words when typed correctly.

FosanzDev commented 11 months ago

Try using my approach. It solves the issue by splitting it in smaller parts.