Open Charlottecuc opened 2 years ago
Thank you for your interest in our research. You asked about two things.
Thank you~
@Charlottecuc @intory89 then does it make sense to introduce audiomentations during vocoder training ?
Here there was a suggestion https://github.com/yl4579/StarGANv2-VC/issues/21 that it's the VC model that should be supplied with corrupted inputs, not the vocoder.
@Charlottecuc is right here -- the model doesn't follow the speed of the source speech for UNSEEN speakers.
Our model did not consider rhythm among the characteristics of Speaker. Please refer to SpeechSplit for related research.
Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?
Thank you very much~ :)
Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me. Do you revise some part of the model? Does the dataset have to the corresponding transcript?
Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?
Thank you very much~ :)
Hi, I also tried the mandarin dataset to train the model. Therefore, there are questions for me. Do you revise some part of the model? Does the dataset have to the corresponding transcript?
Hi. I tested the model with the inference jupyter file your provided. It's amazing that the model can still generate good voice even if a Mandarin source file is fed as input. However, I notice that if the speech rate of source is slow while the speech rate of target is very fast, the speech rate of generated voice will also be fast. I was wondering is it possible to tune the speech rate so that the generated voice can have the same speech rate with the source? Or, is the different speech rate caused by the mismatch of source language (Mandarin vs English, for the pretrained ASR model)? Also, I notice that if I inference the model with noisy source file(e.g. with background of air conditioning), there will also be noise in the generated voice. Is there a way to erase the noise? Or, could you give any advice on noise-robust training/inference?
Thank you very much~ :)