Closed CalVulpes closed 1 month ago
How was it configured in v1?
chunk, extras, Crossfade, Trancate, SilenceFront, Protect, index
The robotic voice sounds very similar to the bug that v2 had before. I do not know if it's the same though. The last "good" version that I tried was 2.0.58. Does it happen on that version too or only in the latest 2.0.60? The differences in your comparison are huge. I would like to know which F0 detection option you used and if the models were PTH or ONNX because, in the past, I had issues with ONNX models sounding robotic.
This was done in 2.0.58 (didn't realize 2.0.60 literally released then and there but I am definitely going to try it there as well).
In V1 it was configured as follows:
Index: 0.6
Chunk: 40 (106.7ms, 5120)
Extra: 131072
Sound Threshold: 0
Protocol: sio
Overlap: 4096
Start: 0.1
End: 0.9
Trancate: 100
SIlenceFront: on
Protect: 0.5
In V2 it was configured as follows:
Index: 0.6
Chunk: 4800
Extra: 131040
Protocol: sio
Noise Gate: -120
(no pass filters)
Both used no Echo/Sup1/Sup2 and used rmvpe_onnx. The file is a PTH file made using the merge lab of V1.
I figured it would be crossfade or one of V1's advanced settings since when I set the crossfade chunk overlap lower I achieved different but nonetheless similar results as follows (overlap @ 128): https://github.com/user-attachments/assets/bb1660c3-5fd4-430f-972f-3bb3233d8e02
EDIT: fixed some grammer :p
Just tried 2.0.60 out! I don't know why but it somehow got rid of a lot of the artifacts? It might have just been a freak accident since I have V1 and V2 on the same computer or possible something with my session. Reinstalled 2.0.58 and it works better now as well! Sorry for that -- probably should have just restarted my computer then and there and tried it out again.
However, there is still a noticeable difference between V1 and V2 from what I think is the advanced settings customization currently missing with V2, especially crossfade.
I recorded some more samples with the same voice and exact same settings as before, with added V1 "reduced crossfade" versions have been set to 1024 overlap to better match V2's artifacts.
V1 Samples
Sample Text with Sigh (Crossfade 4096) https://github.com/user-attachments/assets/fac1486e-53f2-45e4-88a8-5111c79adedb
Sample Text with Sigh (Crossfade 1024) https://github.com/user-attachments/assets/ef70b998-9bf3-493e-86d0-6333337c267c
Sample Longer Sounds (Crossfade 4096) https://github.com/user-attachments/assets/bd5b040c-69db-44f6-9a34-250100010e3a
Sample Longer Sounds (Crossfade 1024) https://github.com/user-attachments/assets/a0d9b766-31e5-45d0-9b19-ea7090e3a60d
V2 Samples
Sample Text with Sigh https://github.com/user-attachments/assets/77713431-a849-4248-b0d6-13f8f61519fc
Sample Longer Sounds https://github.com/user-attachments/assets/0b4ebcce-aecd-40c5-a599-4840b36d1213
Of note, you hear the differences much clearer in longer sounds and "breathing". I'm guessing this might be due to the chunks being much easier to hear when pitch/voice is "stable", which the crossfade might help hide.
EDIT: Forgot to clarify, the V2 samples were done in 2.0.60!
Did you use an audio file for testing or did you use the mic live?
I used a live mic and Voicemeeter Potato to record it as a .wav then converted to .mp4. In hindsight, probably should have just used an audio file and outputted to the client. Still, I think the artifact differences are still clear between V1, V2, and the lower crossfade settings.
The reason of why I asked is because we never speak exactly the same each time. For good results tests, record a wav right from your mic. You can select "file" in the input field in RVC which will open a file browser window to look for the wav file. You will get a little time bar and even a pause/play button. In this way the input is always the same so you can compare better the output results.
With that in mind, just prepared two sets of samples with identical audio files! Same settings as the previous one.
V1/1.5.3.18 (Crossfade = 4096)
Sample Text https://github.com/user-attachments/assets/9dafaf48-bd47-4b97-b859-8cdcb6939964
Longer Noises https://github.com/user-attachments/assets/79c5a647-3549-4a5b-bdc3-434ccfc32f54
V1/1.5.3.18 (Crossfade = 1024)
Sample Text https://github.com/user-attachments/assets/5c333608-5c99-491a-af2f-29f4a7629693
Longer Noises https://github.com/user-attachments/assets/fafe0c4b-ec2e-4c75-8b86-34810c98f78b
V2/2.0.60
I am sorry for bothering you but, have you tried with different F0 detection methods? For example, since your model is PTH, you could try the new (old) rmvpe (reintroduced in 2.0.60). And converting your PTH model to ONNX and using rmvpe_onnx.
yes, v2 is bad than v1 with same chunk and extra
@Jeffrey-deng
How has it gotten worse? As @Kuuko-fokkusugaru mentioned, the audio quality of v2 has significantly improved compared to the initial version. It seems that CalVulpes' issue was resolved by restarting the PC. It might be an issue dependent on the PC.
Additionally, if you want to change the crossfade in the current version, you can modify the vc_conf.json file in the settings folder before launching the application.
@CalVulpes Thank you for the information.
You mentioned that there are differences between v2 and v1, but since random values are used to generate the audio, it's not guaranteed to produce the same result even with the same input.
Having listened to the provided audio, I can't perceive a significant difference between V1/1.5.3.18 (Crossfade = 4096) and V2/2.0.60. Perhaps my ears aren't that sensitive, and that's why I feel this way. Do you think there is a clear difference in quality?
Additionally, if you want to change the crossfade in the current version, you can modify the vc_conf.json file in the settings folder before launching the application. (However, this method is not convenient, so if it seems to be effective, we will restore the GUI in the next version.)
After the restart, when comparing V1 and V2, V2 is ever so slightly noisier but I also notice the lower pitches are more audible, so those are likely from the new volume tune options (haven't experimented much with those yet) but are probably fixable with the new low/high-pass filters and messing with the volume tune a bit.
However, especially with longer noises the difference between chunks is a lot more noticeable. For both it sounds like I'm almost speaking into a fan, but for V1 the "chunk cut-off sound" is a lot smoother and less emphasized compared to V2.
Still, I'll definitely mess with the vc_conf.json file and the F0 methods later and see if I can resolve the differences that way! Also, thank you for looking into the request :D
Just doubled the crossfade from 0.05 to 0.1 in the vc_conf.json (presumably matches up to the 4096 setting on V1), pretty much covers up the "chunk cut-off sound" to about the same extent as my settings on V1 did! Here is the sample with "Longer Noises" on V2 with the new crossfade setting...
https://github.com/user-attachments/assets/4cff5d1b-590d-43ee-9615-cfb45f199491
Thank you for pointing that file out!
EDIT: fixed some spelling errors
Great. I see, that seems effective. In the next version, we will make it possible to set the Crossfade value via the GUI.
Great. I see, that seems effective. In the next version, we will make it possible to set the Crossfade value via the GUI.
It's there any drawbacks to setting it to 0.1 instead of 0.05? Does this value really match the v1 value?
I was just assuming this goes by chunk size (4096 ~ 0.1, probably closer to 0.9) based on the chunk size options in the voice changer itself. I'm not entirely sure though, just working off assumptions there. I probably should have clarified.
The way the voice chunks cut-off sounds still sound different but has the same "smoothness" as it did on my V1 settings. The exact way the sound changes now after increasing the crossfade sounds like its just the lower notes being more prevalent than in V1, which isn't necessarily a bad thing (plus its something fixable by lowering bass or upping treble in an external program). I usually change the crossfade for certain voices since different values work with others better/worse :p
I re-implemented crossfade setting at advanced setting in v.2.0.61-alpha
Please try.
no update. clonse
In a few words, describe your idea
V1 Features such as Crossfade, Trancate, SilenceFront, and Protect
More information
V2 does wonders for CPU/GPU usage and latency, however it has significantly worse and glitchier output quality likely as a result of either missing these features or having the inability to customize them. This is especially noticeable in smaller chunk sizes especially when directly compared with V1's output quality.
A quick comparison of the two with close-enough chunk, extras, and noise suppression settings:
V1: https://github.com/user-attachments/assets/be2e2202-e2e7-488e-b0ca-733adfd1b65a V2: https://github.com/user-attachments/assets/53aac78d-74a2-4120-8c4c-9b3fffdb7d10
I understand that this might be planned however I couldn't find anything stating this directly so I've submitted this as a feature request here :)