Hi @devinschumacher, @francislabountyjr congrats on the repo!
I was trying the clone_voice.ipynb notebook as-is. I just changed the English phrase to something like text_prompt = "Voice cloning is used to create synthetic speech that mimics a specific person's voice".
I decided to try different accents, so I chose Korean and German using the embeddings provided in ./bark/assets. So, the output audio would be the text_prompt (in English) with Korean and German accents, respectively.
After loading the models as stated in the notebook in question, I ran a script in this way:
from scipy.io.wavfile import write as write_wav
language = 'de'
for i in range(10):
voice_name = f"{language}_speaker_{i}"
print("Using voice name:", voice_name)
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.5)
filepath = f"output/{language}_{i}_spk.wav"
write_wav(filepath, SAMPLE_RATE, audio_array)
Korean Results
Out of the 10 Korean speakers in ./bark/assets, only when using kor_speaker_2 the results are good and the content is text_prompt. In the rest outputs, the content is unintelligible, it appears to be "Korean", really far from being English content.
German Results
Here, the results are more consistent for the majority. However, there were a couple of utterances where the results were odd. For example, when using de_speaker_8 (apparently a woman's voice), the content starts with a phrase unrelated to the text_prompt, such phrase goes like "as you mentioned..." spoken by a man, then it switches to the actual content of the text_prompt with a female voice.
Any recommendations on these cases?
I am assuming this is mostly due to the quality of speaker embeddings in ./bark/assets. Are there any new/improved models you guys are using for this?
Hi @devinschumacher, @francislabountyjr congrats on the repo!
I was trying the
clone_voice.ipynb
notebook as-is. I just changed the English phrase to something liketext_prompt = "Voice cloning is used to create synthetic speech that mimics a specific person's voice".
I decided to try different accents, so I chose Korean and German using the embeddings provided in
./bark/assets
. So, the output audio would be thetext_prompt
(in English) with Korean and German accents, respectively.After loading the models as stated in the notebook in question, I ran a script in this way:
Korean Results
Out of the 10 Korean speakers in
./bark/assets
, only when usingkor_speaker_2
the results are good and the content istext_prompt
. In the rest outputs, the content is unintelligible, it appears to be "Korean", really far from being English content.German Results
Here, the results are more consistent for the majority. However, there were a couple of utterances where the results were odd. For example, when using
de_speaker_8
(apparently a woman's voice), the content starts with a phrase unrelated to thetext_prompt
, such phrase goes like "as you mentioned..." spoken by a man, then it switches to the actual content of thetext_prompt
with a female voice.Any recommendations on these cases? I am assuming this is mostly due to the quality of speaker embeddings in
./bark/assets
. Are there any new/improved models you guys are using for this?