Implementation of VITS-2

ezerhouni commented 9 months ago

Hello, I am trying to implement VITS2 but I am getting the following error :

  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 118, in rational_quadratic_spline
    if torch.min(inputs) < left or torch.max(inputs) > right:
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

Do you have an idea where it might come from ? I know that without code it is difficult to know, I will do a PR of the implementation later this week. Thank you

csukuangfj commented 9 months ago

Can you try torch.min(inputs, dim=None)?

The error shows you need to specify the dim argument for torch.min(), though your code looks correct to me.

nshmyrev commented 9 months ago

Same issue https://github.com/coqui-ai/TTS/issues/2555

It comes from the bad data file which doesn't align properly.

csukuangfj commented 9 months ago

@ezerhouni

I suggest that you use https://github.com/rhasspy/piper-phonemize to convert text to tokens.

Otherwise, it may be difficult, if not impossible, to deploy the trained model with C++.

You can find pre-built wheels for Linux and Windows at https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5

@yaozengwei

Do you have any code to share about using piper-phonemizer to convert text to tokens?

ezerhouni commented 9 months ago

@csukuangfj Let me try torch.min(inputs, dim=None) I am trying the LJSpeech recipe for the moment with VITS-2

csukuangfj commented 9 months ago

I am trying the LJSpeech recipe for the moment with VITS-2

Ok, but we are switching to piper-phonemize for converting text to tokens.

Hope that @yaozengwei can push the new tokenizer soon.

yaozengwei commented 9 months ago

I am trying the LJSpeech recipe for the moment with VITS-2

Ok, but we are switching to piper-phonemize for converting text to tokens.

Hope that @yaozengwei can push the new tokenizer soon.

I just uploaded the code here https://github.com/k2-fsa/icefall/pull/1511.

ezerhouni commented 9 months ago

@csukuangfj Now I am getting:

  File "/vits2/egs/ljspeech/TTS/vits2/duration_predictor.py", line 191, in forward
    z = flow(z, x_mask, g=x, inverse=inverse)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vits2/egs/ljspeech/TTS/vits2/flow.py", line 297, in forward
    xb, logdet_abs = piecewise_rational_quadratic_transform(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 175, in rational_quadratic_spline
    assert (discriminant >= 0).all()
AssertionError

I will try with the new tokenizer to see if it fixes the issue

csukuangfj commented 9 months ago

@yaozengwei Could you have a look at the above error?

yaozengwei commented 9 months ago

Hello, I am trying to implement VITS2 but I am getting the following error :

  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 118, in rational_quadratic_spline
    if torch.min(inputs) < left or torch.max(inputs) > right:
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

Do you have an idea where it might come from ? I know that without code it is difficult to know, I will do a PR of the implementation later this week. Thank you

Seems the tensor inputs for torch.min is empty.

csukuangfj commented 9 months ago

>>> import torch
>>> a = torch.empty((0,))
>>> torch.min(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

An empty tensor will indeed throw the same error.

ezerhouni commented 9 months ago

@csukuangfj I might have some good news but it needs a bit more testing. I will let you know next week

ezerhouni commented 9 months ago

Unrelated to VITS-2 (please tell me if you prefer that I open a proper issue), it seems that for the VITS recipes, you are using spectrogram which is using Wav2Spec while the loss is computed using Wav2LogFilterBank is on purpose ?

JinZr commented 8 months ago

hmm, i think we didn't choose this setup on purpose @yaozengwei am i right?

yaozengwei commented 8 months ago

Unrelated to VITS-2 (please tell me if you prefer that I open a proper issue), it seems that for the VITS recipes, you are using spectrogram which is using Wav2Spec while the loss is computed using Wav2LogFilterBank is on purpose ?

We just follows the VITS paper (https://arxiv.org/pdf/2106.06103.pdf), which uses linear spectrogram as input of the posterior encoder (Sec 2.1.3 and Fig.1), and uses mel-scale spectrograms to compute the reconstruction loss (Sec 2.1.2).

ezerhouni commented 8 months ago

@yaozengwei Yes my bad, I misunderstood part of the code

k2-fsa / icefall

Implementation of VITS-2 #1508