Quieted initial audio on 131k Extra Data Length

grimoire-vc commented 1 year ago

Issue Type

Bug Report

vc client version number

MMVCServerSIO_win_onnxgpu-cuda_v.1.5.3.10b

OS

Windows 11 (build 22631)

GPU

RTX 3080 Ti

Clear setting

yes

Sample model

yes

Input chunk num

yes

Wait for a while

The GUI successfully launched.

read tutorial

yes

Extract files to a new folder.

yes

Voice Changer type

RVC v2

Model type

pyTorch f0

Situation

For me and several other users, selecting 131k Extra Data Length with a pyTorch model causes the first 200ms or so of inference to be quieter than normal. I can consistently reproduce this issue by speaking, then muting my input device for 10 seconds, and then speaking again. If I speak again sooner than that, it usually has normal volume. Choosing 65k extra data length or using an onnx model prevents this issue.

I'll link two samples to demonstrate what I mean. Both were produced by taking a recording of myself saying "123" and running it through voice-changer three times. In the 131k recording, the first time has the quieted effect, the second time is after about 1.5s and has normal volume, and the third time is after about 10s and has the quieted effect again. In the 65k recording, the volume is normal all three times.

You'll notice it being quieter when I say "1" the first and third time in the 131k recording.

131k Extra Data Length 65k Extra Data Length

Here are labeled images of the waveforms from Audacity:

You can see the difference in volume in the first and third set of waves, compared to the second set of waves where the waveforms are more similar in size.

My guess is that it's just caused by 131k using more of the preceding silence data for inference, but I'm not sure.

Mojobones commented 1 year ago

Tested locally and I can reproduce this issue as well on the newest version.

nadare881 commented 1 year ago

I looked at the part that seems to be related to VCClient, but there seems to be no problem. https://github.com/w-okada/voice-changer/blob/b4555a6bebf822e9c20edd541d06428605392dc9/server/voice_changer/RVC/RVC.py#L131

When RVC creates training data, if there is silence above a certain level, it is trimmed and learned. It may be a problem specific to the RVC model, so it would be helpful to identify the cause by inserting 1 second of silence before the audio you want to convert with RVC-webUI and checking if the same phenomenon occurs.

Mojobones commented 1 year ago

I looked at the part that seems to be related to VCClient, but there seems to be no problem.

https://github.com/w-okada/voice-changer/blob/b4555a6bebf822e9c20edd541d06428605392dc9/server/voice_changer/RVC/RVC.py#L131

When RVC creates training data, if there is silence above a certain level, it is trimmed and learned. It may be a problem specific to the RVC model, so it would be helpful to identify the cause by inserting 1 second of silence before the audio you want to convert with RVC-webUI and checking if the same phenomenon occurs.

The models I create are ones where I truncate all silence in the file to < 200ms of silence, to ensure there's no stretches of silence. Yet even with those models I get this issue.

grimoire-vc commented 1 year ago

I gave that a shot and the volume seems consistent. I did the same thing as before by taking a clip of myself saying "123" and placing it three times spaced out. This recording has 1.5s of silence before the first speech, then another 1.5s before the next one, then about 9.5s of silence until the next one, inferred in the the RVC 7-17-23 beta web-ui.

https://vocaroo.com/1l2W1jnWsQEL

Audacity_qfmRgQrrfK

Are there any specific settings in the UI that might have a similar effect to extra data length? I could try that. This is the model I'm using for these.

ChinatsuHS commented 1 year ago

I gave that a shot and the volume seems consistent. I did the same thing as before by taking a clip of myself saying "123" and placing it three times spaced out. This recording has 1.5s of silence before the first speech, then another 1.5s before the next one, then about 9.5s of silence until the next one, inferred in the the RVC 7-17-23 beta web-ui.

https://vocaroo.com/1l2W1jnWsQEL

Are there any specific settings in the UI that might have a similar effect to extra data length? I could try that. This is the model I'm using for these.

in advanced settings you can adjust initial silence duration could try and lower that at the cost of some performance. (Default is 0.8 start and 1.0 end), also some default microphone enhancements also add silences (noise canceling enhancements mostly causes this)

w-okada commented 1 year ago

@grimoire-vc So, can I consider this problem resolved?

grimoire-vc commented 1 year ago

@w-okada I'm still able to replicate the issue consistently with different voices, and it seems like it's only happening in voice-changer, I haven't been able to replicate the effect in RVC. I just gave @ChinatsuHS 's suggestions a shot and those didn't resolve the issue for me. I tried adjusting several values in advanced settings, like setting the crossfade start and end to different values (like .1 and .1 as well as other combinations), turning SilenceFront off and on, and setting Trancate to 1, but the issue was still occurring. I also went into my Windows control panel to disable audio enhancements for the devices being used.

Are you able to replicate the issue? It seems consistent for me on PyTorch models at 131k extra data length

w-okada commented 1 year ago

How many chunks are you setting?

grimoire-vc commented 1 year ago

64 is what I used for the provided samples, but I see the same effect on higher chunks. I've tried up to 512.

Mojobones commented 1 year ago

Interesting thing related to this i've started noticing is that it can pop up on 65k extra data length, but it seems less pronounced than 131k

Is it something you've been able to reproduce @w-okada?

grimoire-vc commented 1 year ago

I have also noticed it occasionally with 65k recently. It seems less pronounced and less consistent when it happens, but it does happen from time to time. I tried to reproduce it to share here, but I wasn't able to record it happening. It's 100% consistent with 131k, but seems rarer with 65k.

grimoire-vc commented 1 year ago

Just heard from someone that using extra inference time in the RVC realtime GUI also causes the quieting issue. I'll test it more soon to confirm and see if it's the same thing, but if that's right it seems like it's not a specifically voice-changer thing. I'll update when I've tested it.

grimoire-vc commented 1 year ago

I had trouble getting the realtime GUI to work so I can't confirm that, though I think it's accurate. It turns out though that 32kHz models do not have this problem either. Interesting outcome. With those two facts I think I'm quite convinced that it's not anything to do with voice-changer and is just an outcome of realtime use of at least 40kHz models (not sure about 48kHz).

w-okada / voice-changer