serp-ai / bark-with-voice-clone

🔊 Text-prompted Generative Audio Model - With the ability to clone voices
https://serp.ai/tools/bark-text-to-speech-ai-voice-clone-app
Other
2.95k stars 390 forks source link

Bad Performance of Voice Cloning #63

Open souvikqb opened 10 months ago

souvikqb commented 10 months ago

I am using the https://github.com/serp-ai/bark-with-voice-clone/blob/main/clone_voice.ipynb Notebook to generate audio clips similar to one provided by me.

While the code ran well, the resulting audio file was not really very good. I am using common American and British accents speakers

Any tips to tune the model to correctly get the results or any parameters to play with ?


import sys
sys.path.append('./bark-voice-cloning-HuBERT-quantizer')
import os
from pydub import AudioSegment
from scipy.io.wavfile import write as write_wav
import numpy as np
import torch
import torchaudio
from bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, load_codec_model
from encodec.utils import convert_audio
from bark_hubert_quantizer.customtokenizer import CustomTokenizer
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert

preload_models(
    text_use_gpu=True,
    text_use_small=False,
    coarse_use_gpu=True,
    coarse_use_small=False,
    fine_use_gpu=True,
    fine_use_small=False,
    codec_use_gpu=True,
    force_reload=False
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = load_codec_model(use_gpu=True if device == 'cuda' else False)

hubert_manager = HuBERTManager()
hubert_manager.make_sure_hubert_installed()
hubert_manager.make_sure_tokenizer_installed()

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth', map_location=device).to(device)

"""# Inference"""

text_prompt = 'Hello! How are you?, I am Monster from Monster. I make AI Models for all of you here at Blocks and I am really excited about it. I make Generative AI accessible to all' #@param {type:"string"}
audio_filepath = r'/home/qblocks/Cloning/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm' #@param {type:"string"}

def trim_and_convert_audio(input_path, output_path, target_duration_ms=30000):
    # Load the audio file
    print("Loading Audio File:", input_path)
    audio = AudioSegment.from_file(input_path)
    # Get the duration of the audio in milliseconds
    audio_duration = len(audio)
    # Trim the audio to the target duration
    if audio_duration > target_duration_ms:
        trimmed_audio = audio[:target_duration_ms]
    else:
        trimmed_audio = audio
    # Save the trimmed audio as a WAV file
    trimmed_audio.export(output_path, format="wav")
    print("Trimmed audio saved as:", output_path)

output_audio_path = "converted_audio.wav"  
trim_and_convert_audio(audio_filepath, output_audio_path)

if not os.path.isfile(audio_filepath):
  raise ValueError(f"Audio file not exists ({output_audio_path})")

wav, sr = torchaudio.load(output_audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()

# move codes to cpu
codes = codes.cpu().numpy()
# move semantic tokens to cpu
semantic_tokens = semantic_tokens.cpu().numpy()

voice_filename = 'output3.npz'
current_path = os.getcwd()
voice_name = os.path.join(current_path, voice_filename)

np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.8, waveform_temp=0.8)

# save audio
filepath = "out5.wav" # change this to your desired output path
write_wav(filepath,SAMPLE_RATE,audio_array)
dagshub[bot] commented 10 months ago

Join the discussion on DagsHub!

mathieu-duverne commented 10 months ago

your audio input it's in 24 bit ?

souvikqb commented 10 months ago

I passed in webm and mp3 files, how do I check this?

souvikqb commented 10 months ago

your audio input it's in 24 bit ?

Using this file - https://upload.wikimedia.org/wikipedia/commons/c/c5/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm

Can you elaborate more?

BrasD99 commented 10 months ago

Can you elaborate more?

  1. I wrote the code to isolate wav from your video
    
    from pydub import AudioSegment

def convert_webm_to_wav(input_file, output_file): audio = AudioSegment.from_file(input_file, format="webm") audio.export(output_file, format="wav")

def crop_audio(input_file, output_file, seconds): audio = AudioSegment.from_wav(input_file) processed_audio = audio[:seconds * 1000] processed_audio.export(output_file, format="wav")


usage:

input_wav = 'CA_AG_Kamala_Harris_2013_CADEM_Convention.webm' converted_wav = 'converted.wav' cropped_wav = 'cropped.wav' seconds_to_crop = 60

convert_webm_to_wav(input_wav, converted_wav) crop_audio(converted_wav, cropped_wav, seconds_to_crop)


2. We can determine the bitrate of the audio recording using the code, but I used the website (I was too lazy to write code;)): https://www.advalify.io/audio-validator
<img width="355" alt="image" src="https://github.com/serp-ai/bark-with-voice-clone/assets/36342074/e1479ef6-4c57-43cf-9abd-c89f2136c79d">

3. Your audio is 32 bit:
<img width="235" alt="image" src="https://github.com/serp-ai/bark-with-voice-clone/assets/36342074/a5ea7610-993c-4643-9fd2-1754ed047afd">

4. Use this to convert to 24 bit: https://onlineaudioconverter.com/
souvikqb commented 10 months ago

I see,

Thanks for taking the effort.

But how should I use this to improve the video cloning performance

BrasD99 commented 10 months ago

But how should I use this to improve the video cloning performance

If I understand correctly, do you want to make a deepfake for a video with a voice change?

If yes, here is the code to convert to 24 bits (https://stackoverflow.com/questions/44812553/how-to-convert-a-24-bit-wav-file-to-16-or-32-bit-files-in-python3):

import soundfile

input_wav = 'input.wav' # Maybe 32 bit?
output_wav = 'output.wav'

data, samplerate = soundfile.read(input_wav)
soundfile.write(output_wav, data, samplerate, subtype='PCM_24')
souvikqb commented 10 months ago

Yes thats a possibility

But for now I would just like a Voice Cloned Audio File

Say - Reading a normal speech but with a celebrity's or a user defined speaker voice

Does converting it to 24 bits help in video cloning process?

BrasD99 commented 10 months ago

@souvikqb In fact, I myself have encountered difficulties when cloning a voice. Unfortunately, they do not give an answer to my question, but the option with a 24-bit translation gives me little hope of success. I will try it on my own data...

souvikqb commented 10 months ago

Thanks 👍

Do let me know if you get anything

Also can we tag the owner of this repository?

BrasD99 commented 10 months ago

@souvikqb I think we can tag Francis @francislabountyjr.

I'm also stuck on my issue ;( #49

BrasD99 commented 10 months ago

@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.

souvikqb commented 10 months ago

@souvikqb how can i contact you? I found another solution (from another project). I will not write here, because it does not apply to this project.

Please email me on -> autocar2060 @ gmail . com

rajashekhargithubb commented 10 months ago

@souvikqb @BrasD99 , if you guys succeeded in generating better voice cloning, could you please put your outputs here?

Shyk92 commented 8 months ago

@BrasD99 Is it simply the bit rate difference causing the issue? I'd love to hear if there are other factors one could employ to improve the clone.

having difficulties just using my own voice with good results.

litterally one time out of a handful did I hear my voice. and it was a single "umm" at the start before switching back to some person who does not sound like mehaha

platform-kit commented 7 months ago

@Shyk92 did you ever make progress on this? I'm in the same boat.

aryansid commented 2 weeks ago

Facing same issue