Open ryoppippi opened 1 year ago
Here is my audio file(sorry for my bad pronunciation :( )
Same issue for me, but with using a basic voice name as well.
I have found the solution. For the voice name you should specify the full path to the voice file, not only its name. For me:
voice_name = 'bark/bark/assets/prompts/' + "karoly" + '.npz'
I have the same issue. The above changes nothing for me. Some files work while some throw the error, even though they're recorded the same way and the same length.
When it does work, the results are not very similar to the target at all.
The error likely stems from this line 499 in generate.py
:
round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
@darkpanther99
My npz
file is successfully stored in bark/assets/prompts/
, so the path specification doesn't solve my issue
But thanks for suggestion
I think this is bark
's problem because this repo adds only interface of bark
As a temporary fix, you can just yank the assert
block. Clearly that's not ideal, since that means something is wonky, but at least things start working again.
I commented out the last and block in the assert, and it seems to be working for now
Adding 'MAN: ' or 'WOMAN: ' in the text transcript helped to fix the problem for me.
Thanks. It works when comment outing assert block
The result is hmm...
I've run this without any changes (both from notebook as well as from a standalone python file) and it runs without errors. The voice file sounds nothing like me though ;) I've used the same source wav to train Tortoise-TTS which sounds amazing.
I assume something's going wrong, but there's no error output and all steps complete as they should.
it works but the results are nothing like the target voice
Are you sure you're using the bark
files included with this repo, not the standard files from the pip install
command? You might need to replace the files by hand.
it works but the results are nothing like the target voice
Agreed, it barely seems to be influenced by the input samples at all. Wonder if I'm missing something.
Agreed, it barely seems to be influenced by the input samples at all. Wonder if I'm missing something.
I've seen ~20 people all saying the same thing. I hope we are missing something but I doubt it.
The default parameters for generate_text_semantic()
might be a bit off. I tried raising the temperature from 0.7
to 0.95
which seemed to help. May need to experiment with top_k
and top_p
as well.
According to Readme, currently it may seems that it is not allowed unless it is npz provided by Suno.
According to Readme, currently it may seems that it is not allowed unless it is npz provided by Suno.
This repo patches in voice cloning unofficially.
When I run it the cell runs but nothing happens..I guess Im missing something somewhere. Oh well back to tortoise.
Surprisingly I managed to get all of this up and running (have literally zero experience with all this). But the output audio of a voice clone has absolutely nothing to do with the input voice. Something is not quite right.
@troed cool. Could you give us how you Dudu it??
@ryoppippi Getting it to run? I just opened the notebook in Jupyter and 1) ran it as well as 2) copied it out to its own .py and ran it from cli.
But just as everybody else my results are ... bad. The generated voice sounds nothing like me (my example). Now, granted, there's an update to the repo saying that the results are better with very short samples (like, 2-3s). I haven't tried that, but also I can't understand how that would work really.
@troed Oh, I see. Then what you did is the same as I did.
I am Japanese and I tried to generate voice using the Japanese model provided by suno. The generated voice contained a lot of noise like We may not be able to expect as good results as we think. Maybe there are some things we can CONTRIBUTE to the original implementation. We shall see.
I tried cloning off of a 20 sec audio file. In tortoise this works great. I was hit with the assertion error and had to comment it out. Adding "man:" or whatever to prompt in any case didn't work. Also had to make the "notebooks" into scripts because I am not running this on some collab or other service.
20 second voice file was 160k
8 second was 55k, more in line with other voices
The process of generating a wav took up about 5GB of vram and 3 minutes.
I cloned a male voice and got a woman. I did it twice.. the first sample sounded like female version of the person. Weird.
For me, the solution was to limit the recording to 4 seconds, however, as others have said, the cloning doesn't work at all. I tested it for English and hindi both and it fails miserably.
I think it's going to take finetuning to get a consistent voice clone out of the current models. We are working on that now!
I've no memory... it's no possible to train with only 4Gb vram... is there a way?
I think it's going to take finetuning to get a consistent voice clone out of the current models. We are working on that now! @francislabountyjr hi! Is there any progress on this matter?
you need an audio/text pair of less than 7 seconds
I cloned voice like this.
Then, I tried to generate my voice following the notebook like this
And it causes error
I'm not sure what happens