ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"
MIT License
1.79k stars 529 forks source link

Manual Control of Phoneme Durations #89

Open hypnaceae opened 3 years ago

hypnaceae commented 3 years ago

I'd like to supply the synthesiser with custom phoneme durations (i.e start and end time of each phoneme), in other words bypassing and replacing phoneme duration prediction with my own parameters. Is it possible to do this in this implementation?

leminhnguyen commented 3 years ago

Yes, you can !

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/fastspeech2.py#L43-L58

Set the d_targets to your custom durations (the default value is None and model will predict the durations)

hypnaceae commented 3 years ago

Great, thanks! Can I also ask, what data & type does this variable accept? Just a list of phoneme durations? I have tried a variety of types (example: d_targets = [0.5, 0.1, 0.25] as durations for phonemes "K AE1 T", a dict as phoneme:duration, etc) but none have worked. What's the exact usage here? Thanks again.

leminhnguyen commented 3 years ago

@hypnaceae

The training data was a great example for you. When training you will push the ground truth value of duration, pitch and energy to d_targets, p_targets and e_targets. So please inspect the preprocessed files (ends with .npy) for more details.

The d_targets must be the int array which each of element indicates the length (number of frames in mel-spectrogram) of each phoneme. Ex: d_targets = [3, 4, 5] for phoneme sequence "K AE1 T".

hypnaceae commented 3 years ago

Thanks. I set d_targets (FastSpeech2.py, line 54) to your example and I'm getting the following traceback.

>>synthesize.py --text "cat" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

Removing weight norm...
Raw Text Sequence: cat
Phoneme Sequence: {K AE1 T}
Traceback (most recent call last):
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 214, in <module>
    synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 99, in synthesize
    d_control=duration_control
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\fastspeech2.py", line 91, in forward
    d_control,
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 129, in forward
    x, mel_len = self.length_regulator(x, duration_target, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 194, in forward
    output, mel_len = self.LR(x, duration, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 171, in LR
    expanded = self.expand(batch, expand_target)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 186, in expand
    expand_size = predicted[i].item()
TypeError: 'int' object is not subscriptable

It looks like predicted is taking the first value of the d_targets array, in this case 3.

To clarify: I want to specify the number of mel spectrogram frames on a per-phoneme basis at synthesis time. I am also not training my own models (just using the pretrained LJSpeech model) so I can't see any .npy files.

Thanks again, you've been a big help thus far.

debasishaimonk commented 1 year ago

has anyone worked on this anymore?