Open jungtaekyung1 opened 1 year ago
Hello @jungtaekyung1, thanks for opening this issue.
epoch?
It's been a while since I last ran these experiments, and I currently don't have access to the server on which I ran these models. However, from model.json
, it appears that the model was trained for around 100 epochs. But it's also possible I trained for longer; sorry that I don't have the right details at the moment.
Can you provide information about train, val ratio
As for the split ratio, I think I used 50 Korean songs from CSD, of which I used 40 songs for training and the rest for validation and evaluation. You can find the 8 songs we used for evaluation here. FYI I didn't use any of the English songs.
I know this noise is because I didn't train the vocoder and use the one provided.
The provided vocoder, IIRC, is just the default HiFi-GAN pretrained on LJSpeech. So you will hear some artifacts. I'm not sure why inference would speed things up by 2x though.
This model also produces songs that are trained well, but conversely, songs that are not trained, such as csd songs, produce songs at noise levels. It is judged as a different problem from the csd-based model created at twice the speed.
Based on my hyperparameters, maybe the model was trained for too long. Do you know if it is overfit? Can you use intermediate checkpoints to run inference?
Based on my hyperparameters, maybe the model was trained for too long. Do you know if it is overfit? Can you use intermediate checkpoints to run inference?
The loss of the model trained by my own data is close to 1, and the trained song is well made, so I judge it to be overfitting.
1.I can run inference according to the checkpoint, but if I do the inferred wav file, I can only check the phoneme accuracy difference according to the number of epochs.
I think that you trained with 40 songs is considered to mean that you trained with 80 data due to the nature of the csd data divided into a and b.
The batchsize written in model.json is train: 384, val: 368. Shouldn't this be smaller than the number of songs?
Or is the above batch-size a hyperparameter that is affected when wav, txt, and mid read for learning are read as an array?
I preprocessed the sampling rate of the trained wav files to 22050, and modified the sampling-rate of configs/preprocess.json and hifi-gan/config.json to 22050.
However, songs that have not been learned, such as the issue mentioned above at the beginning, come out twice as fast in inference.
Is there a hyperparameter of sampling-rate that I haven't considered?
The batchsize written in model.json is train: 384, val: 368. Shouldn't this be smaller than the number of songs?
Or is the above batch-size a hyperparameter that is affected when wav, txt, and mid read for learning are read as an array?
Each song is at least a few seconds long, and the model cannot be trained on the whole song sequence. Therefore, we sample a partial segment from the song to use it for training. So there are definitely more than 384 such segments in the entire training set.
Through the litle_star you uploaded as a demo, I figured out that the sampling-rate of the csd you trained was 44100,
The CSD originals might have been 44K, but I downsampled them to 22K for training.
However, songs that have not been learned, such as the issue mentioned above at the beginning, come out twice as fast in inference.
Is there a hyperparameter of sampling-rate that I haven't considered?
Are the unseen songs you're running inference on 44K? How have you preprocessed the midi files to produce model inputs? I think you should look at the inputs you are feeding into the model at inference time and try doubling it. It's clear to me why this step would be necessary though.
I know this noise is because I didn't train the vocoder and use the one provided.
Also, this song has a normal tempo, the same speed as the mid used for training.
However, inference using the lyrics and mid of a song not used for training speeds up by a factor of 2.
number of song : 30 epoch : 740
This model also produces songs that are trained well, but conversely, songs that are not trained, such as csd songs, produce songs at noise levels. It is judged as a different problem from the csd-based model created at twice the speed.
The problem I am thinking about is hyperparameters.
Can you provide information about train, val ratio and epoch? I know the number of songs I have is low, but I want to be sure compared to the epoch I used.
++ Also, I tried transfer learning to solve the lack of data in my data. Learning the csd data 100 times and learning 30 of my songs 60 times on top of it gave a similar result to learning only my own songs.
This time, while waiting for your answer, I will try to mix only csd and my data, and use only csd as a checkpoint to mix and learn.