w-okada / voice-changer

リアルタイムボイスチェンジャー Realtime Voice Changer
Other
16.01k stars 1.74k forks source link

[ISSUE]: the voice sounds lispy and inauthentic #1301

Open awarenessaspie opened 1 month ago

awarenessaspie commented 1 month ago

Voice Changer Version

MMVCServerSIO_win_onnxgpu-cuda_v.1.5.3.18a.zip

Operational System

Windows 11

GPU

Nvidia Geforce RTX 4060

Read carefully and check the options

Model Type

MMVC

Issue Description

No response

Application Screenshot

No response

Logs on console

C:\Users\User\Desktop\MMVCServerSIO>MMVCServerSIO.exe -p 18888 --https false --content_vec_500 pretrain/checkpoint_best_legacy_500.pt --content_vec_500_onnx pretrain/content_vec_500.onnx --content_vec_500_onnx_on true --hubert_base pretrain/hubert_base.pt --hubert_base_jp pretrain/rinna_hubert_base_jp.pt --hubert_soft pretrain/hubert/hubert-soft-0d54a1f4.pt --nsf_hifigan pretrain/nsf_hifigan/model --crepe_onnx_full pretrain/crepe_onnx_full.onnx --crepe_onnx_tiny pretrain/crepe_onnx_tiny.onnx --rmvpe pretrain/rmvpe.pt --model_dir model_dir --samples samples.json Booting PHASE :main PYTHON:3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] Activating the Voice Changer. [Voice Changer] download sample catalog. samples_0004_t.json [Voice Changer] download sample catalog. samples_0004_o.json [Voice Changer] download sample catalog. samples_0004_d.json [Voice Changer] model_dir is already exists. skip download samples. Internal_Port:18888 protocol: HTTP


Please open the following URL in your browser.
http://<IP>:<PORT>/
In many cases, it will launch when you access any of the following URLs.
http://127.0.0.1:18888/

[VCClient] Access http://127.0.0.1:18888/ [VCClient] wait web server...0 http://127.0.0.1:18888/ [Voice Changer] generate new embedder. (no embedder) [Voice Changer] Loading index... [Voice Changer] Index file is not found [VCClient] wait web server... done 200 [INFO] [DSH] voice-changer-native-client.exe [INFO] [DSH] Creating WndMsg Listener Window [INFO] [DSH] Get number of capabilities [INFO] [DSH] Get stream caps: 0 [INFO] [DSH] Get stream caps: 1 [INFO] [DSH] Get stream caps: 2 [INFO] [DSH] Get stream caps: 3 [INFO] [DSH] Get stream caps: 4 [INFO] [DSH] Get stream caps: 5 [INFO] [DSH] Get stream caps: 6 [INFO] [DSH] Get stream caps: 7 [INFO] [DSH] Get stream caps: 8 [INFO] [DSH] Get stream caps: 9 [INFO] [DSH] Get stream caps: 10 [INFO] [DSH] Get stream caps: 11 [INFO] [DSH] Get stream caps: 12 [INFO] [DSH] Get stream caps: 13 [INFO] [DSH] Get stream caps: 14 [INFO] [DSH] Get stream caps: 15 [INFO] [DSH] Get stream caps: 16 [INFO] [DSH] Get stream caps: 17 [INFO] [DSH] Get stream caps: 18 [INFO] [DSH] Get stream caps: 19 [INFO] [DSH] Get stream caps: 20 [INFO] [DSH] Get stream caps: 21 [INFO] [DSH] Get stream caps: 22 [INFO] [DSH] Get stream caps: 23 [INFO] [DSH] Get stream caps: 24 [INFO] [DSH] Get stream caps: 25 [INFO] [DSH] Get stream caps: 26 [INFO] [DSH] Get stream caps: 27 [INFO] [DSH] Get stream caps: 28 [INFO] [DSH] Get stream caps: 29 [INFO] [DSH] Get stream caps: 30 [INFO] [DSH] Get stream caps: 31 [INFO] [DSH] Get stream caps: 32 [INFO] [DSH] Destroying parent object [INFO] [DSH] Destroying WndMsg Listener Window [INFO] [DSH] Destroyed window [INFO] [DSH] Unregistered window class

awarenessaspie commented 1 month ago

I forgot to mention the issue. My voice is not normally lisp and is authentic. But despite all the settings, my voice does not sound lispy and authentic.

Kuuko-fokkusugaru commented 1 month ago

This will heavily depend on each voice model. Not all of them are trained with quality, not all of them sounds believable. Can I ask which model did you use and a screeshot of all your settings in the software?

awarenessaspie commented 1 month ago

Of course! The language I use is Turkish and there is also a Turkish model. Here are the settings I have attached; ss

awarenessaspie commented 1 month ago

I should also mention that all models, including Turkish models, have a lisp and sound inauthentic.

Kuuko-fokkusugaru commented 1 month ago

Does increasing the CHUNK to something way bigger, even if you get a higher delay, fix the issue or does it remain?

awarenessaspie commented 1 month ago

No it does not fix the problem on the contrary it remains.

awarenessaspie commented 1 month ago

I should point out that there is no problem with low volume voice models such as ASMR but with most models I still have the problems I mentioned.

Kuuko-fokkusugaru commented 1 month ago

Since the software itself is working, and assuming that your microphone have enough quality, I will guess that your issue is mostly a pronunciation/model relation. Every language have different phonemes. And while some have a lot of them, some other haven't so many. Most models are trained with a Japanese or English base. That means that, in the training process, the speech is being mapped into different words and phonemes. If your language is rich in different phonemes and the model lacks some of them, it may get replaced by what the software considers is the closest thing, which may not be the best in all the cases. Sometimes (often), models of different languages are still trained with English or Japanese datasets which means that they are mapping phonemes wrongly and this gives even a worse result. As an example, I have seen a lot of Japanese voices being trained as if they were English which resulted in awful results. That said, if your pronunciation is far from what the model "knows", the results can be disappointing. Your best case would be a model trained in Turkish by a Turkish voice. Pronunciation and accents are heavily impacted by the usage of the index file as well. Using an English voice with an index file and setting of 1.0, will result in a 100% English accent which often is an issue if you are not using English at all. Index set to 0 eliminates any accent so it's easier to use the voice on different languages but there is still limitations. In my case, Japanese is very close in pronunciation to my main language so it sounds perfect with my voice model. And there isn't any issues with English as well. So try to test and see if different language models may give you better results. This is definitely something hard to fix at software level but you can try different settings for the F0 detection like crepe. I have also notice that the sound threshold is maxed. That means that you may need to speak quite loud for the software to actually to pick your voice. This setting is useful if you don't want the software to pick noise environment or someone near speaking, TV, etc, but you shouldn't need to max it. I'd recommend to test with it at minimum and using sup 1 to get rid of noise with sup 2 too if sup 1 isn't enough. There could be a case that certain pronunciations are going below the sound threshold and being ignored by the software. Also noticed that gain is 0.8 for both, input and output. In very rare cases you may want those to be below 1. Because I doubt that your mic input is too loud at 1, same doe for he output which usually you have more likely to increase it. And maybe at 0.8 the mic is missing some information.

Alternatively, if nothing of the above works, besides working on your pronunciation to make it a bit lore "clear" for the software to understand what you say, you could also try the v2 of the software. It's still in development and it could have a higher performance impact but it also contains a new F0 detection that you can try named fcpe. Maybe you have a bit of more luck with that one.

You can also try to download v1 again and start all over decompressing to a new folder and test with the default settings except the one to set your gpu.

https://huggingface.co/wok000/vcclient000/tree/main

awarenessaspie commented 1 month ago

Very nice. I'm downloading the v2 version right away. I followed what you said, I think I will have to use the Turkish model. When I trained my voice a little bit, I noticed an improvement. Can I provide feedback after downloading the v2 version? After that, I will ask you to close the issue.

awarenessaspie commented 1 month ago

image I just downloaded it. It's better than before, but I can't hear my voice. What settings do you recommend?

Kuuko-fokkusugaru commented 1 month ago

You have set the input to a 0.7, this is the volume of your mic which usually 1.0 should be fine. You also have set the output to 0.1, which means that you have set the volume to the minimum. Set it to 1.0 to start. You have set the exit and monitor to the same output. You can't do that. Output device should be the one where you want to send the voice to. Usually you would put a virtual audio cable here. Monitor is ONLY to set a output that you want to hear so you can test if it's working properly. Here usually you set your headset or speakers so you can listen to yourself. This setting have no effect on the output, it's only for preview so you would more likely set to none when you don't want to listen yourself. The gain sliders for output and monitor are just their volume. Setting monitor to the same volume as output will give you an exact idea of how loud the other person may hear you. The reason why they have independent volume controls is because, in the case of leaving monitor activated, you may want it to sound very low, 0.1 - 0.5 for example, so you don't hear yourself too loud which could be distracting. In your screenshot you have set the monitor and out to the same devices which doesn't makes sense. And since one is at minimum volume, I guess that's why it doesn't sounds.

So for recap, set input and output gain to 1.0. Set the volume of your speakers or headset to something normal and not too loud or too low. Now click the pass-through button to hear your own voice and speak normally. If it sounds clear, leave it like that. If it sounds too low, either increase the volume settings of the mic in windows settings until it sounds correctly or use the input gain to increase the volume. Once your mic volume is fine, click the pass-through button again to hear your converted voice. Set output to your speakers or headset but set monitor to none. Speak and adjust output gain to your desired volume, the volume that you want others to hear you at. Once you finish this, set output to your virtual audio cable and from now on use monitor only when you want to hear yourself. On your recording or communication software you should use the other end of the virtual audio cable. And that's it. That should work.

awarenessaspie commented 1 month ago

Thank you so much for your concern and patience! It worked and the issue can be locked. Thank you very much!

Kuuko-fokkusugaru commented 1 month ago

Thank you so much for your concern and patience! It worked and the issue can be locked. Thank you very much!

Now that you have fixed your issue with v2, does using similar settings in v1 fixes those issues too? Have you tried using the same voice model, similar chunk and extra size, etc? Just wondering because you can use very similar settings and I would like to know if, even using the same settings, same F0 detection, and same voice model, it would still sound better in v2 than v1. Just curious to know how much v2 could make of a difference. And, you're welcome. My pleasure to help you ☺

awarenessaspie commented 1 month ago

Of course I would like to share my experience with v2. I would like to tell you that it voice more authentic than the v1, although it doesn't necessarily solve the problem when I try the same voice model. Of course, I have to tell you that the sound is a little bit lispy at times, but I have to say that it works better than v1. In short, I must say that v2 provides a more realistic voice changing than v1. I want to thank you again <3

Kuuko-fokkusugaru commented 1 month ago

Thank you for the feedback, is really interesting.

blackvikingx commented 1 month ago

what is v2? imagen

Kuuko-fokkusugaru commented 1 month ago

what is v2? imagen

All the versions that starts with a 2 instead of a 1. The latest ones are 2.0.56. Then you pic Windows or Mac version. And std for standard (cpu, Intel, AMD) OR CUDA for nvidia cards.