[BUG REPORT] 1.wav contains NaN

kamileqq commented 1 year ago

When i'm clicking extract pitches it shows that "1.wav contains NaN" and like this for 59 other files, these files have sound in them. I tried doing it on pm and harvest extraction methods

Expected behavior webui should extract pitches succesfully

Screenshots

gitmylo commented 1 year ago

ffmpeg the sounds back to wav. Also don't use pm for training, use harvest or any crepe variant

kamileqq commented 1 year ago

hello now it started throwing "Exception: Each element of kernel_size should be odd. Skipping" many times

gitmylo commented 1 year ago

that might be an issue with your filter radius

kamileqq commented 1 year ago

yeah setting the filter radius to 0 or 1 fixed the "kernel size should be odd" issue but i still get the NaN

galagaygay commented 1 year ago

I also have this problem, no matter what is set in the train interface, an error is reported. (NaN)

windows10 conda Python3.8.10

gitmylo commented 1 year ago

I can't really reproduce it, do you have ffmpeg installed, and can you check the files in the data/training/RVC/Test/0_16k/ folder? these should be playable wav files, if not, something went wrong earlier down the line.

also check if you have ffmpeg installed, that could also cause this

kamileqq commented 1 year ago

diogoalmeida1991 commented 1 year ago

Hello! The bug persist here too!

Has it a debug mode or CLI mode to test?

gitmylo commented 1 year ago

There is no debug mode or CLI mode. The error is printed here: https://github.com/gitmylo/audio-webui/blob/559bd7ebfab0162a353e585188a3bae9d309320b/webui/ui/tabs/training/training/rvc_workspace.py#L302 Which implies at least one of the features contains NaN.

You could try a different dataset, see if that makes a difference.

diogoalmeida1991 commented 1 year ago

I changed the files into WAV 16000 in Audacity, but the problem persist.

The line wich create the bug is "logits = model.extract_features(**inputs)".

Follow bellow the variables before and after this line:

model variable:

HubertModel( (feature_extractor): ConvFeatureExtractionModel( (conv_layers): ModuleList( (0): Sequential( (0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False) (1): Dropout(p=0.0, inplace=False) (2): Fp32GroupNorm(512, 512, eps=1e-05, affine=True) (3): GELU(approximate='none') ) (1-4): 4 x Sequential( (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False) (1): Dropout(p=0.0, inplace=False) (2): GELU(approximate='none') ) (5-6): 2 x Sequential( (0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False) (1): Dropout(p=0.0, inplace=False) (2): GELU(approximate='none') ) ) ) (post_extract_proj): Linear(in_features=512, out_features=768, bias=True) (dropout_input): Dropout(p=0.1, inplace=False) (dropout_features): Dropout(p=0.1, inplace=False) (encoder): TransformerEncoder( (pos_conv): Sequential( (0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16) (1): SamePad() (2): GELU(approximate='none') ) (layers): ModuleList( (0-11): 12 x TransformerSentenceEncoderLayer( (self_attn): MultiheadAttention( (dropout_module): FairseqDropout() (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (dropout1): Dropout(p=0.1, inplace=False) (dropout2): Dropout(p=0.0, inplace=False) (dropout3): Dropout(p=0.1, inplace=False) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (final_proj): Linear(in_features=768, out_features=256, bias=True) )

Variable input: {'source': tensor([[ 1.3542e-04, 3.9577e-04, 8.7976e-05, ..., -2.2034e-02, -3.8208e-02, -4.8920e-02]], device='cuda:0', dtype=torch.float16), 'padding_mask': tensor([[False, False, False, ..., False, False, False]], device='cuda:0'), 'output_layer': 12}

Variable logits: (tensor([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0', dtype=torch.float16), tensor([[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]], device='cuda:0'))

gitmylo commented 1 year ago

I'm still not sure if it's a hardware/software difference, or something with the audio files.

diogoalmeida1991 commented 1 year ago

I find the bug! Change in "def pitch_extract():" the line 263 "model = model.half()" to "model = model.float()" and the line 285 ""source": features.half().to(device)" to ""source": features.float().to(device)". This occur due 16xx gpu.

Source: https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/204

gitmylo commented 1 year ago

I find the bug! Change in "def pitch_extract():" the line 263 "model = model.half()" to "model = model.float()" and the line 285 ""source": features.half().to(device)" to ""source": features.float().to(device)". This occur due 16xx gpu.

Source: RVC-Project/Retrieval-based-Voice-Conversion-WebUI#204

Thank you so much, I'll fix this immediately.

gitmylo commented 1 year ago

It will use a bit more memory, but training uses more than it anyways, so that won't be a problem.

gitmylo / audio-webui

[BUG REPORT] 1.wav contains NaN #90