gitmylo / audio-webui

A webui for different audio related Neural Networks
MIT License
1.01k stars 94 forks source link

[BUG REPORT] Tensor Size error while training RVC model #112

Closed athu16 closed 1 year ago

athu16 commented 1 year ago

Describe the bug After creating an RVC workspace and preprocessing the data, this error occurs when I click the Train button.

To Reproduce Steps to reproduce the behavior:

  1. Create or load a workspace in the "Train" tab.
  2. Assuming you have extracted pitches, go to the "train" sub-tab and click the blue Train button.
  3. Check the log.
  4. See error.

Expected behavior It should start training.

Log

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
v2 48k
gin_channels: 256 self.spk_embed_dim: 109
<All keys matched successfully>
<All keys matched successfully>
D:\Github\audio-webui\venv\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ..\aten\src\ATen\native\SpectralOps.cpp:867.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
torch.Size([4, 128, 28]) torch.Size([4, 128, 33])
D:\Github\audio-webui\webui\ui\tabs\training\training\rvc_workspace.py:737: UserWarning: Using a target size (torch.Size([4, 128, 33])) that is different to the input size (torch.Size([4, 128, 28])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss_mel = F.l1_loss(y_mel, y_hat_mel) * HParams.c_mel
Traceback (most recent call last):
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\routes.py", line 437, in run_predict
    output = await app.get_blocks().process_api(
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\blocks.py", line 1352, in process_api
    result = await self.call_function(
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\blocks.py", line 1093, in call_function
    prediction = await utils.async_iteration(iterator)
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\utils.py", line 341, in async_iteration
    return await iterator.__anext__()
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\utils.py", line 334, in __anext__
    return await anyio.to_thread.run_sync(
  File "D:\Github\audio-webui\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "D:\Github\audio-webui\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "D:\Github\audio-webui\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "D:\Github\audio-webui\venv\lib\site-packages\gradio\utils.py", line 317, in run_sync_iterator_async
    return next(iterator)
  File "D:\Github\audio-webui\webui\ui\tabs\training\training\rvc_workspace.py", line 737, in train_model
    loss_mel = F.l1_loss(y_mel, y_hat_mel) * HParams.c_mel
  File "D:\Github\audio-webui\venv\lib\site-packages\torch\nn\functional.py", line 3263, in l1_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "D:\Github\audio-webui\venv\lib\site-packages\torch\functional.py", line 74, in broadcast_tensors
    return _VF.broadcast_tensors(tensors)  # type: ignore[attr-defined]
RuntimeError: The size of tensor a (28) must match the size of tensor b (33) at non-singleton dimension 2
gitmylo commented 1 year ago

48k RVC models training is not implemented, although the options do show. https://github.com/gitmylo/audio-webui/issues/46

athu16 commented 1 year ago

Ah ok. Does v2 40k work?

gitmylo commented 1 year ago

Yep, both 40k modes work.

athu16 commented 1 year ago

Also I found the text to speech menu a bit confusing at first. The Model menu lets you select between Coqui TTS and Bark. The Bark part is easy to understand, but selecting Coqui TTS, and going to the TTS model dropdown presents a huge list of models from Tortoise, Tacotron2, VITS and some others. I think it would be helpful if the Model menu was expanded to let you select other backends like Tortoise and VITS, and the TTS model dropdown contained the model name and backend/method used.

gitmylo commented 1 year ago

Coqui is a library used for text to speech. It handles tortoise, tacotron2 and vits. But it's selected as a model as the dropdown it is in is meant for models.

If you want a seperate implementation of Tortoise, Tacotron2, VITS, or another, you can create an extension which adds it to the dropdown if you want to implement it yourself.

Again, for clarity, bark and Coqui are 2 seperate implementations. But the models inside of Coqui are not. I do not provide expansions to the source code for Coqui models, only for specifically implemented models, doing so with Coqui models is far more likely to break in an update.