w-okada / voice-changer

リアルタイムボイスチェンジャー Realtime Voice Changer
Other
16.59k stars 1.81k forks source link

[ISSUE]: Voices don't sound as they should / low GPU usage #1401

Closed ElDukoYouKnow closed 1 week ago

ElDukoYouKnow commented 2 weeks ago

Voice Changer Version

MMVCServerSIO_win_onnxgpu-cuda_v.1.5.3.18a

Operational System

Windows 11 pro

GPU

Rtx 3090

Read carefully and check the options

Model Type

RVC

Issue Description

Hello, what happens to me is that the voices not only sound bad, with a kind of breathing or moaning even when i don't speak (I have a good microphone and there is no background noise) but the voices themselves do not sound anything like those of the character I try to imitate, They all sound similar, even if they are from very different characters such as Sukuna or Homer Simpson (The voices of male characters are similar to those of other characters of the same sex and vice versa). I downloaded about 20 models from AI Hub and weights.gg but none seem to work propperly. In addition to that, the use of the GPU is unstable, never exceeds 35% and has abrupt drops between 4% and 25% of GPU use.

Regarding the voice models I read somewhere that maybe the problem will be solved by retraining those models based on my voice and intonation, or doing a fine tuning, I would appreciate if you give me a link of a tutorial because I can't find any. Could it be that or a problem with the program?

Can anyone help me?

Application Screenshot

Captura de pantalla 2024-11-10 212352

Logs on console

F:\rvc\MMVCServerSIO>MMVCServerSIO.exe -p 18888 --https false --content_vec_500 pretrain/checkpoint_best_legacy_500.pt --content_vec_500_onnx pretrain/content_vec_500.onnx --content_vec_500_onnx_on true --hubert_base pretrain/hubert_base.pt --hubert_base_jp pretrain/rinna_hubert_base_jp.pt --hubert_soft pretrain/hubert/hubert-soft-0d54a1f4.pt --nsf_hifigan pretrain/nsf_hifigan/model --crepe_onnx_full pretrain/crepe_onnx_full.onnx --crepe_onnx_tiny pretrain/crepe_onnx_tiny.onnx --rmvpe pretrain/rmvpe.pt --model_dir model_dir --samples samples.json Booting PHASE :main PYTHON:3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Activating the Voice Changer. [Voice Changer] download sample catalog. samples_0004_t.json [Voice Changer] download sample catalog. samples_0004_o.json [Voice Changer] download sample catalog. samples_0004_d.json [Voice Changer] model_dir is already exists. skip download samples. Internal_Port:18888 protocol: HTTP


Please open the following URL in your browser.
http://<IP>:<PORT>/
In many cases, it will launch when you access any of the following URLs.
http://127.0.0.1:18888/

[VCClient] Access http://127.0.0.1:18888/ [VCClient] wait web server...0 http://127.0.0.1:18888/ gin_channels: 256 self.spk_embed_dim: 109 [Voice Changer] generate new embedder. (no embedder) [Voice Changer] Loading index... Try loading... model_dir\9\model.index [VCClient] wait web server... done 200 [2024-11-10 21:27:00] connet sid : cEWNMlTZOm3VgbiuAAAB [2024-11-10 21:27:00] connet sid : BnWM7tzO7iV_JYeZAAAD

Kuuko-fokkusugaru commented 2 weeks ago

The extra sounds like moaning or noises it's just that the software picks whatever other sound is in the background, including breathing or even touching the mic accidentally. I can see that you have marked the echo setting. If there is no echo in your room, untick it and use instead sup 1 or sup 2 (or both) to suppress noise in the background. You can also increase S. Threshold so it doesn't picks any sounds that isn't loud enough inside of the threshold.

About retraining the models with your voice, that's not how it works and that's not something that can be done. The models are fine but your tune setting seems wrong. While the quality of the voice is directly related to the quality of the model itself, you need to setup the tune properly. For female to male, use a value of approximately - 12. For male to female, around 12. For male to male or female to female you may use a range between -4 to 4 approximately. Those values depends a lot on how originally deep or high your voice pitch is in comparison with that of the model. The closer you adjust the tune to sound like the original voice, the more accurate and realistic the output will be. I also recommend using rmvpe instead of crepe.

And lastly, in regards to gpu usage, that's quite normal. RVC doesn't need to use 100% of the GPU to work. Also, a value of 128 for the chunk should be enough for a good quality without a huge delay. The extra can be maxed easily to reach higher quality.

ElDukoYouKnow commented 2 weeks ago

Tyy , with those settings the extra noise is gone, and RMVPE works a little better than Crepe. But I already knew about the tune settings, the models don't sound similar even with that, they don't take my words so well either. Is something else i can change in the settings?

Kuuko-fokkusugaru commented 2 weeks ago

Well, that's a complex subject. Some models simply aren't well made, with bad audio source, and with lack of sound time for a proper training. Usually, you have to try your best to imitate the original person that the voice is from to sound closer to them. Some voices also includes an index file. In such cases, increasing or maxing the index value will also "copy" most of their accent. But ultimately it will depend on the quality of the model and your interpretation. If it can't pick some of what you say, lower the s. Threshold if is above minimum. You can also check that the volume of your mic is at 100% and can also increment the input right in RVC. If words still cut and aren't picked, increase chunk size. But always try to speak as clear as possible for better results.

I'd suggest to try some other models to see if those works fine just to discard that your issue isn't other than the models' quality. Politicians and such often gives the best results because those have plenty of clean speech material to work with so they can be handy to test if RVC is working correctly.

ElDukoYouKnow commented 2 weeks ago

Thanks, now the program picks the audio better, with respect to the models now I will try one of a president as you say, but I have already tried quite a few models, I don't know if it will turn out

Kuuko-fokkusugaru commented 2 weeks ago

There is something to keep in mind that I forgot to mention. Models sound better in their native language. An English model speaking Spanish may sound a bit weird or even with some accent (especially if you use index, for models speaking other than their native language don't use index). This is because there is a set of phonemes that may be missing from one language to another. The richer the language is in phonemes, the more versatile will be and may be able to be used in more different languages. Most English speaking models that you can find freely laying around are pretty bad trained and gives pretty bad results. They often pick phonemes in different ways and sounds completely off.

Ultimately, if you feel like those issues are really related to the software and they don't go away no matter what you try, give it a shot to the version 2 of the software.

ElDukoYouKnow commented 2 weeks ago

Yes yes, I downloaded models in Spanish thinking just about that.

I think I'm going to try that version, do I have to download a specific one for my system? Right now I'm downloading the one that says vcclient_win_cuda_2.0.72-beta.zip and weighs 3 gigabytes

Kuuko-fokkusugaru commented 2 weeks ago

The issue is, people aren't always experienced enough (I am not either at training models, never did) but often people train Spanish models using English base for the phonemes. When the model's phonemes are in a different language than the expected one, it can result in inconsistent and weird accents and sounds. But again, I am not acknowledged enough on the matter. I personally use Japanese models which seems to work fine for Spanish and English.

And yes, that's the right version for you since you will use an nvidia GPU.

ElDukoYouKnow commented 1 week ago

Thank you very much for all the help, I have not yet tried version 2 anyway, I hope it does not give me many problems