insunhwang89 / StyleVC

MIT License
29 stars 3 forks source link

About the speech rate of generated voice #2

Open Charlottecuc opened 2 years ago

Charlottecuc commented 2 years ago

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

insunhwang89 commented 2 years ago

Thank you for your interest in our research. You asked about two things.

  1. Is the speech adjustable according to the speed of the given target speech?
    • Our model cannot control the rhythm. Rhythm is a property related to the speed of speech. Therefore, the speech generated by our model will follow the speed of speech in the source speech. In target speech, only speech features are extracted and used.
  1. When a speech with noise is used, noise is generated in the generated speech.
    • We did not conduct a separate experiment for noise. But let me tell you our experience. Since speech data for noise was not used when learning Vocoder, it is inevitably vulnerable to noise. Therefore, in order to solve this problem, finetuning of the vocoder model should be performed on the noise data. Or, I have to find a model that is strong against noise.
Charlottecuc commented 2 years ago

Thank you~

skol101 commented 2 years ago

@Charlottecuc @intory89 then does it make sense to introduce audiomentations during vocoder training ?

Here there was a suggestion https://github.com/yl4579/StarGANv2-VC/issues/21 that it's the VC model that should be supplied with corrupted inputs, not the vocoder.

skol101 commented 1 year ago

@Charlottecuc is right here -- the model doesn't follow the speed of the source speech for UNSEEN speakers.

insunhwang89 commented 1 year ago

Our model did not consider rhythm among the characteristics of Speaker. Please refer to SpeechSplit for related research.

Superman-Valencia commented 1 year ago

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me. Do you revise some part of the model? Does the dataset have to the corresponding transcript?

Superman-Valencia commented 1 year ago

Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?

Thank you very much~ :)

Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me. Do you revise some part of the model? Does the dataset have to the corresponding transcript?