Closed grimoire-vc closed 1 year ago
Tested locally and I can reproduce this issue as well on the newest version.
I looked at the part that seems to be related to VCClient, but there seems to be no problem. https://github.com/w-okada/voice-changer/blob/b4555a6bebf822e9c20edd541d06428605392dc9/server/voice_changer/RVC/RVC.py#L131
When RVC creates training data, if there is silence above a certain level, it is trimmed and learned. It may be a problem specific to the RVC model, so it would be helpful to identify the cause by inserting 1 second of silence before the audio you want to convert with RVC-webUI and checking if the same phenomenon occurs.
I looked at the part that seems to be related to VCClient, but there seems to be no problem.
When RVC creates training data, if there is silence above a certain level, it is trimmed and learned. It may be a problem specific to the RVC model, so it would be helpful to identify the cause by inserting 1 second of silence before the audio you want to convert with RVC-webUI and checking if the same phenomenon occurs.
The models I create are ones where I truncate all silence in the file to < 200ms of silence, to ensure there's no stretches of silence. Yet even with those models I get this issue.
I gave that a shot and the volume seems consistent. I did the same thing as before by taking a clip of myself saying "123" and placing it three times spaced out. This recording has 1.5s of silence before the first speech, then another 1.5s before the next one, then about 9.5s of silence until the next one, inferred in the the RVC 7-17-23 beta web-ui.
https://vocaroo.com/1l2W1jnWsQEL
Are there any specific settings in the UI that might have a similar effect to extra data length? I could try that. This is the model I'm using for these.
I gave that a shot and the volume seems consistent. I did the same thing as before by taking a clip of myself saying "123" and placing it three times spaced out. This recording has 1.5s of silence before the first speech, then another 1.5s before the next one, then about 9.5s of silence until the next one, inferred in the the RVC 7-17-23 beta web-ui.
https://vocaroo.com/1l2W1jnWsQEL
Are there any specific settings in the UI that might have a similar effect to extra data length? I could try that. This is the model I'm using for these.
in advanced settings you can adjust initial silence duration could try and lower that at the cost of some performance. (Default is 0.8 start and 1.0 end), also some default microphone enhancements also add silences (noise canceling enhancements mostly causes this)
@grimoire-vc So, can I consider this problem resolved?
@w-okada I'm still able to replicate the issue consistently with different voices, and it seems like it's only happening in voice-changer, I haven't been able to replicate the effect in RVC. I just gave @ChinatsuHS 's suggestions a shot and those didn't resolve the issue for me. I tried adjusting several values in advanced settings, like setting the crossfade start and end to different values (like .1 and .1 as well as other combinations), turning SilenceFront off and on, and setting Trancate to 1, but the issue was still occurring. I also went into my Windows control panel to disable audio enhancements for the devices being used.
Are you able to replicate the issue? It seems consistent for me on PyTorch models at 131k extra data length
How many chunks are you setting?
64 is what I used for the provided samples, but I see the same effect on higher chunks. I've tried up to 512.
Interesting thing related to this i've started noticing is that it can pop up on 65k extra data length, but it seems less pronounced than 131k
Is it something you've been able to reproduce @w-okada?
I have also noticed it occasionally with 65k recently. It seems less pronounced and less consistent when it happens, but it does happen from time to time. I tried to reproduce it to share here, but I wasn't able to record it happening. It's 100% consistent with 131k, but seems rarer with 65k.
Just heard from someone that using extra inference time in the RVC realtime GUI also causes the quieting issue. I'll test it more soon to confirm and see if it's the same thing, but if that's right it seems like it's not a specifically voice-changer thing. I'll update when I've tested it.
I had trouble getting the realtime GUI to work so I can't confirm that, though I think it's accurate. It turns out though that 32kHz models do not have this problem either. Interesting outcome. With those two facts I think I'm quite convinced that it's not anything to do with voice-changer and is just an outcome of realtime use of at least 40kHz models (not sure about 48kHz).
Issue Type
Bug Report
vc client version number
MMVCServerSIO_win_onnxgpu-cuda_v.1.5.3.10b
OS
Windows 11 (build 22631)
GPU
RTX 3080 Ti
Clear setting
yes
Sample model
yes
Input chunk num
yes
Wait for a while
The GUI successfully launched.
read tutorial
yes
Extract files to a new folder.
yes
Voice Changer type
RVC v2
Model type
pyTorch f0
Situation
For me and several other users, selecting 131k Extra Data Length with a pyTorch model causes the first 200ms or so of inference to be quieter than normal. I can consistently reproduce this issue by speaking, then muting my input device for 10 seconds, and then speaking again. If I speak again sooner than that, it usually has normal volume. Choosing 65k extra data length or using an onnx model prevents this issue.
I'll link two samples to demonstrate what I mean. Both were produced by taking a recording of myself saying "123" and running it through voice-changer three times. In the 131k recording, the first time has the quieted effect, the second time is after about 1.5s and has normal volume, and the third time is after about 10s and has the quieted effect again. In the 65k recording, the volume is normal all three times.
You'll notice it being quieter when I say "1" the first and third time in the 131k recording.
131k Extra Data Length 65k Extra Data Length
Here are labeled images of the waveforms from Audacity:
You can see the difference in volume in the first and third set of waves, compared to the second set of waves where the waveforms are more similar in size.
My guess is that it's just caused by 131k using more of the preceding silence data for inference, but I'm not sure.