I am at the 1st step of adaspeech training as per paper. Source Model Training. I used Libritts dataset, but reduced it to half to expedite the experiment. It has 1140 speakers for training. There was little mismatch in preprocessing parameters in adaspeech paper and default values provided in code. We went with the value of the code. We trained the model for 300k steps on colab. I am providing screenshot of my loss profile from tensor-board.
Please don't mind multiple color in graphs. While training on colab I had to restore training multiple times, leading to separate log files. But more fluctuating one is Train loss while the smoother line is validation loss. I also attaching output I took from inference.py with speaker ID 107 on an out of the sample test sentence at 160k, 170k and 210k steps. Since I cannot attach .wav/.mp3 here, or may be I don't know how to do that, I am attaching drive link where they are hosted. Reference audio for 107 will give you an idea, how does speaker sound like.
https://drive.google.com/drive/folders/19Og2t4h2quygmrJ87xEMPoTQ7yTz9Q_e?usp=sharing
My output is little metallic and grainy, has little reverberations and pitch needs to improve. I want to understand on what all dimensions it need to improve? Also, what can i do better in training to do that?
Hi Folks,
I am at the 1st step of adaspeech training as per paper. Source Model Training. I used Libritts dataset, but reduced it to half to expedite the experiment. It has 1140 speakers for training. There was little mismatch in preprocessing parameters in adaspeech paper and default values provided in code. We went with the value of the code. We trained the model for 300k steps on colab. I am providing screenshot of my loss profile from tensor-board.
Please don't mind multiple color in graphs. While training on colab I had to restore training multiple times, leading to separate log files. But more fluctuating one is Train loss while the smoother line is validation loss. I also attaching output I took from inference.py with speaker ID 107 on an out of the sample test sentence at 160k, 170k and 210k steps. Since I cannot attach .wav/.mp3 here, or may be I don't know how to do that, I am attaching drive link where they are hosted. Reference audio for 107 will give you an idea, how does speaker sound like. https://drive.google.com/drive/folders/19Og2t4h2quygmrJ87xEMPoTQ7yTz9Q_e?usp=sharing
My output is little metallic and grainy, has little reverberations and pitch needs to improve. I want to understand on what all dimensions it need to improve? Also, what can i do better in training to do that?