Open apresence opened 2 months ago
Here is an example with the new voice steering feature on, and one with it off.
It's a simple on/off setting. Other than that, you'd use Parler-TTS just like you normally would. I've also added the ability to save voices you like so you can reuse them later, even between program executions.
Again, each sentence is a separate generation. With steering on, the voice consistency is pretty good. With it off, it varies considerably.
The model, voice description, seed, etc. are all the same between the two examples, only the new steering feature was turned on or off.
Here's an updated voice clone example. I had used mini before because it's more consistent with it's production. Although it doesn't sound as good, I was able to one-shot it.
Large takes a lot of wrangling to get it to behave, so it took a few passes. It could be that my source audio is not good enough (background hum, mic pops, echos).
Anyway, this is pretty good for a quick POC!
I got Parler-TTS zero-shot crying now. Check it out here. 100% of this audio was generated by Parler-TTS, along with some light editing in Audacity,
I'm just having way to much fun with Parler. Did a radio DJ voice for a fake podcast I call Under the Covers with ImcE™. Check it out here. ImcE's voice was generated by Parler-TTS, even the parts where he fumbles his speech and does his vocal warm-up. The singing was generated with RVC. Audio clips came from the same YouTube interview I mentioned earlier.
PR #141 submitted. This is in preparation for the voice steering feature.
Could you provide the information about how to implement the voice consistency in audio ?
suman819 Could you provide the information about how to implement the voice consistency in audio ?
I'm working on code to do that. In the meantime, I submitted a PR that is required.
When I'm done there will be a working example to start from.
Soon!
suman819 Could you provide the information about how to implement the voice consistency in audio ?
I'm working on code to do that. In the meantime, I submitted a PR that is required.
When I'm done there will be a working example to start from.
Soon!
Thank you for support and Update!
This is a working snippet to continue audio in the same style as a speaker. In the snippet, we require:
init_audio_file
: path to an audio file (init_audio.wav
) containing the speaker's voiceinit_prompt
: what the speaker has saidCurrently, this is working with either of the following PRs:
Credit to @ylacombe (from https://github.com/huggingface/parler-tts/pull/110#discussion_r1745381980) for fleshing out this snippet - I've just manipulated it a bit for simplicity.
import soundfile as sf
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed
from parler_tts import ParlerTTSForConditionalGeneration
# TODO: Adapt the following as per your requiremens
init_audio_file = "path/to/init_audio.wav"
init_prompt = "Here, write the transcript of the init audio"
description = (
"A man speaker speaks quickly with a low-pitched voice. "
"The recording is of very high quality, with the speaker's voice sounding clear and very close up."
)
prompt = "Is it really working ?"
# Load the Models
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "parler-tts/parler-tts-mini-v1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
SAMPLING_RATE = model.config.sampling_rate
# Load the init audio
init_audio, init_sr = torchaudio.load(init_audio_file)
init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE)
init_audio = init_audio.mean(0) # Take the mean across the channel dim
# Encode the init audio using the feature extractor
input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device)
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# NOTE: concatenate the init_propt and prompt when passing into the model
prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device)
set_seed(2)
# Generate the audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)
# Save the audio
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)
Thanks for sharing! Weirdly it was working for me with LibriSpeech dev-clean recordings (16k) but not my own (44k). But somehow, downsampling to 16k and then up again 44k fixes this! Strange behaviour though, maybe something to do with how data was preprocessed during training...
Effectively I added
init_audio = torchaudio.functional.resample(init_audio, init_sr, 16_000)
init_audio = torchaudio.functional.resample(init_audio, 16_000, SAMPLING_RATE)
to the snippet.
(Edit: by "not working", I mean generating <1s of audio without and speech, just a random sound effectively)
Entering the final lap here I think on releasing this code.
One issue I'm seeing is that I get the following from time to time when using voice cloning. I haven't had a chance to look into it yet as I'm focusing on getting the rest of it shored up. Any ideas?
2024-10-03 04:46:17,708 [Thread-11 (_] [ERROR] Exception during generation request f14d6f81-7b14-4736-b7ca-55471ebfc923 with {'prompt_input_ids': tensor([[ 6185, 13830, 1423, 24, 48, 19, 125, 3808, 20253, 7,
103, 12, 151, 6, 902, 887, 5, 101, 174, 12,
214, 230, 55, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
device='cuda:0'), 'prompt_attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0]], device='cuda:0'), 'input_ids': tensor([[ 71, 2335, 12192, 44, 46, 1348, 4974, 28, 46, 16822,
1929, 16, 3, 9, 182, 3, 24092, 1345, 53, 1164,
28, 964, 2931, 463, 5, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0]], device='cuda:0'), 'streamer': <parler_tts.streamer.ParlerTTSStreamer object at 0x783572757fd0>, 'min_new_tokens': 10, 'input_values': tensor([[[0.0008, 0.0030, 0.0032, ..., 0.0028, 0.0019, 0.0011]]],
device='cuda:0')}: 'Traceback (most recent call last):\n File "/app/parts/cli/parcls.py", line 1594, in _generation_thread_fn\n _ = self.model_inst.generate(**gt.generation_kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/appuser/miniconda3/envs/parts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/app/parts/repo/parler_tts/modeling_parler_tts.py", line 3500, in generate\n output_ids = output_ids[mask].reshape(batch_size, self.decoder.num_codebooks, -1)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nRuntimeError: shape \'[1, 9, -1]\' is invalid for input of size 4715\n'
This is a working snippet to continue audio in the same style as a speaker. In the snippet, we require:
init_audio_file
: path to an audio file (init_audio.wav
) containing the speaker's voiceinit_prompt
: what the speaker has saidCurrently, this is working with either of the following PRs:
Credit to @ylacombe (from #110 (comment)) for fleshing out this snippet - I've just manipulated it a bit for simplicity.
import soundfile as sf import torch import torchaudio from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed from parler_tts import ParlerTTSForConditionalGeneration # TODO: Adapt the following as per your requiremens init_audio_file = "path/to/init_audio.wav" init_prompt = "Here, write the transcript of the init audio" description = ( "A man speaker speaks quickly with a low-pitched voice. " "The recording is of very high quality, with the speaker's voice sounding clear and very close up." ) prompt = "Is it really working ?" # Load the Models device = "cuda" if torch.cuda.is_available() else "cpu" model_id = "parler-tts/parler-tts-mini-v1" model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device) tokenizer = AutoTokenizer.from_pretrained(model_id) feature_extractor = AutoFeatureExtractor.from_pretrained(model_id) SAMPLING_RATE = model.config.sampling_rate # Load the init audio init_audio, init_sr = torchaudio.load(init_audio_file) init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE) init_audio = init_audio.mean(0) # Take the mean across the channel dim # Encode the init audio using the feature extractor input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device) input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) # NOTE: concatenate the init_propt and prompt when passing into the model prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device) set_seed(2) # Generate the audio generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values) # Save the audio audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)
Hi, I tried to use it for voice consistency and run like this:
import soundfile as sf
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoTokenizer, set_seed
from parler_tts import ParlerTTSForConditionalGeneration
# TODO: Adapt the following as per your requiremens
init_audio_file = "response_good_emotion.wav"
init_prompt = "Here, write the transcript of the init audio"
description = (
"A man speaker speaks quickly with a low-pitched voice. "
"The recording is of very high quality, with the speaker's voice sounding clear and very close up."
)
prompt = "Is it really working ?"
# Load the Models
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "parler-tts/parler-tts-mini-v1"
model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
SAMPLING_RATE = model.config.sampling_rate
# Load the init audio
init_audio, init_sr = torchaudio.load(init_audio_file)
init_audio = torchaudio.functional.resample(init_audio, init_sr, SAMPLING_RATE)
init_audio = init_audio.mean(0) # Take the mean across the channel dim
# Encode the init audio using the feature extractor
input_values = feature_extractor(init_audio, sampling_rate=SAMPLING_RATE, return_tensors="pt").input_values.to(device)
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
# NOTE: concatenate the init_propt and prompt when passing into the model
prompt_input_ids = tokenizer(init_prompt + " " + prompt, return_tensors="pt").input_ids.to(device)
set_seed(2)
# Generate the audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, input_values=input_values)
# Save the audio
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)
but got error TypeError: DACModel.encode() got an unexpected keyword argument 'input_ids'
@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.
@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.
Okay my bad got some dependecy issue I don't saw and it was not updated properly, so I need to change my program. @Guppy16 So I updated using pip install git+https://github.com/huggingface/parler-tts.git but still get same error should I do it diffrently when I do parler_tts.version i got version 0.2.
Thanks for sharing! Weirdly it was working for me with LibriSpeech dev-clean recordings (16k) but not my own (44k). But somehow, downsampling to 16k and then up again 44k fixes this! Strange behaviour though, maybe something to do with how data was preprocessed during training...
Effectively I added
init_audio = torchaudio.functional.resample(init_audio, init_sr, 16_000) init_audio = torchaudio.functional.resample(init_audio, 16_000, SAMPLING_RATE)
to the snippet.
(Edit: by "not working", I mean generating <1s of audio without and speech, just a random sound effectively)
I can confirm this. It still goes wonky sometimes, but certainly the unwanted artifacts/blank audio utterances are much less frequent. Of course, the audio quality isn't as good since you're getting 16/32kHz, just resampled to 44.1kHz.
FWIW, I tried 32kHz with similar results. I wonder if it's due to MusicGen (which Parler is based on) being designed for 32kHz? Or perhaps there were 16/32kHz samples in the dataset, thus there is a more diverse pool for the model to pull from. Maybe @ylacombe can comment on this?
@eustlb @ylacombe et. al. --
Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished.
As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now.
However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] triggered by the following guard failure(s):
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189
This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation.
In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning.
Any ideas?
@eustlb @ylacombe et. al. --
Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished.
As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now.
However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576 V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] triggered by the following guard failure(s): V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189 V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189
This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation.
In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning.
Any ideas?
OK, I retract my statement. It does not seem to trigger a recompile.
Thanks!
@eustlb @ylacombe et. al. -- Another little wrinkle with voice steering/cloning. It defeats compilation because the input_values don't support padding or an attention mask as-is. I'm sure it could be done, but I'm trying to focus on getting my code finished. As an example that we've already covered, let's say I have padding set to 50 for text tokenization. Normally that would result in a guard failure because the cache_position is 1 the first pass, then 51 the second. I know to expect that now. However, when we have the 50 padding and pass input_values, even more cache entries are created (189 in the following example). This results in another guard failure and another recompilation:
V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] Recompiling function forward in /app/parts/repo/parler_tts/modeling_parler_tts.py:2576 V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] triggered by the following guard failure(s): V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 1, actual 189 V1004 07:07:29.471000 124948412483136 torch/_dynamo/guards.py:2611] [0/2] [__recompiles] - tensor 'L['cache_position']' size mismatch at index 0. expected 51, actual 189
This is tolerable as long as you are using the same input_values you compiled with, but as soon as you use different values, the size cache changes, resulting in a guard failure, and another recompilation. In other words, anyone using this would have to wait for compilation (which can take several minutes) any time they'd use a different voice with steering/cloning. Any ideas?
OK, I retract my statement. It does not seem to trigger a recompile.
Thanks!
Scratch my scratch.
It recompiles when the length of input_values is longer than the previous length.
@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.
Thanks @Guppy16 my bad you were right. I got it running without errors but It works a bit differently, I tested both branches and created a fresh environment basically It generates the same audio as in init_audio_file and the text is the same as in recorded audio (generated in sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)). So it doesn't use the prompt? I played with it in my code and generated different prompts but then the voice was different every time though a bit closer than before. Maybe I am doing something there could somebody explain I just wanted to have a voice for continuous conversation.
Entering the final lap here I think on releasing this code.
One issue I'm seeing is that I get the following from time to time when using voice cloning. I haven't had a chance to look into it yet as I'm focusing on getting the rest of it shored up. Any ideas?
2024-10-03 04:46:17,708 [Thread-11 (_] [ERROR] Exception during generation request f14d6f81-7b14-4736-b7ca-55471ebfc923 with {'prompt_input_ids': tensor([[ 6185, 13830, 1423, 24, 48, 19, 125, 3808, 20253, 7, 103, 12, 151, 6, 902, 887, 5, 101, 174, 12, 214, 230, 55, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'prompt_attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'input_ids': tensor([[ 71, 2335, 12192, 44, 46, 1348, 4974, 28, 46, 16822, 1929, 16, 3, 9, 182, 3, 24092, 1345, 53, 1164, 28, 964, 2931, 463, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'streamer': <parler_tts.streamer.ParlerTTSStreamer object at 0x783572757fd0>, 'min_new_tokens': 10, 'input_values': tensor([[[0.0008, 0.0030, 0.0032, ..., 0.0028, 0.0019, 0.0011]]], device='cuda:0')}: 'Traceback (most recent call last):\n File "/app/parts/cli/parcls.py", line 1594, in _generation_thread_fn\n _ = self.model_inst.generate(**gt.generation_kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/appuser/miniconda3/envs/parts/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/app/parts/repo/parler_tts/modeling_parler_tts.py", line 3500, in generate\n output_ids = output_ids[mask].reshape(batch_size, self.decoder.num_codebooks, -1)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nRuntimeError: shape \'[1, 9, -1]\' is invalid for input of size 4715\n'
I figured it out. The output_ids
look something like this:
tensor([[1025, 438, 438, ..., 1024, 1024, 1024],
[1025, 1025, 254, ..., 1024, 1024, 1024],
[1025, 1025, 1025, ..., 1024, 1024, 1024],
...,
[1025, 1025, 1025, ..., 1024, 1024, 1024],
[1025, 1025, 1025, ..., 417, 1024, 1024],
[1025, 1025, 1025, ..., 947, 720, 1024]], device='cuda:0')
Shape: [9, 1210]
What you're seeing there is the delay pattern mask. Look at the docstring for the function build_delay_pattern_mask
for how that works. Anyway, when the mask is removed, which basically just removes the 1025 (BOS) tokens on the left and 1024 (PAD) tokens on the right, you get a 1d tensor. Something like this:
tensor([438, 438, 698, ..., 741, 947, 720], device='cuda:0')
Shape: [10800]
Now, the next thing to do is to break it up into codebooks. As far as I can tell, num_codebooks
is always 9. To do that, this code is executed:
output_ids = output_ids.reshape(batch_size, num_codebooks, -1)
And you end up with something like this:
tensor([[[438, 438, 698, ..., 698, 698, 438],
[254, 459, 954, ..., 232, 875, 937],
[689, 475, 106, ..., 612, 640, 30],
...,
[426, 522, 639, ..., 825, 116, 721],
[364, 520, 257, ..., 895, 236, 417],
[702, 223, 462, ..., 741, 947, 720]]], device='cuda:0')
That works all fine and dandy as long as the 1d tensor is a multiple of 9. If it's not, you get that error about shape [1, 9, -1] being invalid. The 1 is the batch size, 9 is num_codebooks, and -1 picks up the length of the source tensor.
So here's my proposed "fix", to be run after reverting the mask and before reshape:
num_codebooks = self.decoder.num_codebooks
rem_len = output_ids.size(0) % num_codebooks
if rem_len != 0:
# Calc how many pad tokens needed, then append them to the output_ids
pad_len = num_codebooks - rem_len
# Experimenting with different options for padding here, including just repeating the last token
pad_tok = output_ids[-1] # ... or generation_config.pad_token_id
output_ids = torch.cat((output_ids, pad_tok.expand(pad_len)), dim=0)
With that the error goes away, but the audio is always blank in the testing I've been doing. So there must be something else deeper going on ...
For all those folks who keep commenting about issues with voice steering/cloning. I can assure you, there are lots of little gotchas. I am working on code that takes care of all of them. I will release it when the kinks are out. It makes a lot more sense for one person to do it and share rather than 100 trying to figure it out and running into the same problems ;).
@apresence thanks for your good work I like the project a lot so I am rooting for you. Would be nice to contribute when I learn a bit
@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.
Thanks @Guppy16 my bad you were right. I got it running without errors but It works a bit differently, I tested both branches and created a fresh environment basically It generates the same audio as in init_audio_file and the text is the same as in recorded audio (generated in sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)). So it doesn't use the prompt? I played with it in my code and generated different prompts but then the voice was different every time though a bit closer than before. Maybe I am doing something there could somebody explain I just wanted to have a voice for continuous conversation.
Hey @lukaLLM great to see that you have it working! Unfortunately the voice cloning is very difficult to get right with a "new" voice; this feature is better used to continue the voice that is already generated. Here are some things you can try:
set_seeds(42)
before every voice generationHope this helps
FWIW, I tried 32kHz with similar results. I wonder if it's due to MusicGen (which Parler is based on) being designed for 32kHz? Or perhaps there were 16/32kHz samples in the dataset, thus there is a more diverse pool for the model to pull from.
@apresence I looked a bit more into this, and the training data seems to be 24kHz (LibriTTS-R) and 48kHz (MLS) so theoretically either of those should be fine. When training the model, both are resampled to 44kHz - so maybe the model just struggles when there is no resampling? I will have to try 48kHz input next.
Just wanted to update those that have been waiting that I have continued to actively work on this and hope to finish soon.
Thanks!
Many thanks to @Guppy16 and @apresence for your work on this!
If you're interested in working towards the next step, it'd be great to:
Would anyone be interested in contributing ?
Yes, this is part of what I'm working on.
Thanks!
From: Yoach Lacombe @.> Sent: Monday, October 14, 2024 2:16:05 AM To: huggingface/parler-tts @.> Cc: apresence @.>; Mention @.> Subject: Re: [huggingface/parler-tts] Voice Consistency Working Pretty Well -- Plus Zero-Shot Cloning! (Issue #139)
Many thanks to @Guppy16https://github.com/Guppy16 and @apresencehttps://github.com/apresence for your work on this!
If you're interested in working towards the next step, it'd be great to:
Would anyone be interested in contributing ?
— Reply to this email directly, view it on GitHubhttps://github.com/huggingface/parler-tts/issues/139#issuecomment-2411060888, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5EQDRJTIK6EZOB4E3BVALZ3OYYJAVCNFSM6AAAAABOTTW6AOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJRGA3DAOBYHA. You are receiving this because you were mentioned.Message ID: @.***>
Hey, how is the progress on this so far? Any update?
I've personally been experimenting with both Parler-TTS and OpenVoice, and I've noticed that OpenVoice uses a separate step for tone color converter, essentially taking in a base audio (generated from the TTS portion) and then changing it in a second step to the specific tone. Perhaps that is something you're already doing or can be done / used in the meantime? I'm not a Python / AI developer whatsoever (i'm a Golang / Typescript person who's very new to this) but I wonder if it would be possible to combine the two. I also wonder what kind of performance you would get.
Out of all the TTS engines i've tried, this one is by far the most performant and balanced overall. This is really quite amazing.
I've been tinkering a lot with this, and I think personally I might use RVC or something to change the voice to what I want it to be. Seems like a reliable option until this is fully working.
I got it working pretty well by getting the closest prompt in terms of speaker embedding from a random 1000 sample subset of parler-tts/libritts-r-filtered-speaker-descriptions.
A better way would be to use the exact procedure from huggingface/dataspeech for a single sample though. But I don't think there is an easy way to do this for just one sample.
Hey, it looks like this work should allow use of parler to generate longer text in chunks with consistency between each sample? If I'm following the thread properly, this was incorporated in #141 but I'm not seeing example code of how to actually do it.
Hey, it looks like this work should allow use of parler to generate longer text in chunks with consistency between each sample? If I'm following the thread properly, this was incorporated in #141 but I'm not seeing example code of how to actually do it.
See the snippet above, just use the previous chunk as init audio.
Did you manage to make it work using the https://github.com/huggingface/parler-tts/pull/141 I was waiting for an example but now rereading the whole chat here I got confused. Could it be achieved using output generated from parler with predefined voice for example using Wills's voice and your example above @Guppy16 ? I just aim to get a consistent men's voice. Also, you use mini v1 is large not supposed to be more consistent?
@lukaLLM which branch r u on? You need to be in the aforementioned branch to be able to run this code.
Thanks @Guppy16 my bad you were right. I got it running without errors but It works a bit differently, I tested both branches and created a fresh environment basically It generates the same audio as in init_audio_file and the text is the same as in recorded audio (generated in sf.write("parler_tts_out.wav", audio_arr, SAMPLING_RATE)). So it doesn't use the prompt? I played with it in my code and generated different prompts but then the voice was different every time though a bit closer than before. Maybe I am doing something there could somebody explain I just wanted to have a voice for continuous conversation.
Hey @lukaLLM great to see that you have it working! Unfortunately the voice cloning is very difficult to get right with a "new" voice; this feature is better used to continue the voice that is already generated. Here are some things you can try:
- use an already trained on voice (e.g. Jenny or the default ParlerTTS voices)
- try a much longer enrolment
- keep the seed the same using e.g.
set_seeds(42)
before every voice generationHope this helps
I've managed to get a POC with voice consistency working pretty well. Along the way, I've figured out how to do ok-ish zero-shot voice cloning, too. It took drawing on tidbits spread between several issues posted here, the HF repos, the various github sources linked here and there, and about two weeks of experimentation on my part to get going.
Here is an example of zero-shot voice cloning. Between each sentence, I alternate ground truth and Parler TTS audio between the left and right channels. I also lead the ground truth audio with an upwards tone, and Parler with a downwards tone. I did this primarily for my own purposes so I could compare them more closely myself.
The ground truth audio is from a YouTube interview found here.
Only a 5-second snippet of ground truth audio was required to do the clone. Each sentence in the audio sample is a new Parler-TTS generation using text from the audio transcript. As you can hear, the consistency is pretty good. It's even better for voices in the training dataset.
For comparison, here is an example comparing cloning vs non-cloning generation. All the settings are the same between the two, only the cloning feature being on or off differs.
Code, credits and further details forthcoming -- I have to clean up things and get rid of some bugs first for fear that the code-shamers will eat me alive. 😅