Speaker and linguistic embedding visualizations do not look good as in the paper

jxzhanggg / nonparaSeq2seqVC_code

Implementation code of non-parallel sequence-to-sequence VC

MIT License

248 stars 56 forks source link

Speaker and linguistic embedding visualizations do not look good as in the paper #29

Closed huukim136 closed 4 years ago

huukim136 commented 4 years ago

Hi @jxzhanggg ,

I trained your model and the converted speeches sound promising (I also attached some samples below). Then, I tried to visualize the speaker and linguistic embeddings. However, it did not seem perfectly overlapped as in the paper. Moreover, there are still some outliers lied in where it should not have been. (You can observe it in the figures below). So I'm wondering if it's due to the wrong chosen parameters for t-SNE visualization function (eg. perplexity, iteration, learning_rate, etc.) or something else.

Could you give me some comments about that. Thank you!

samples.zip

jxzhanggg commented 4 years ago

Hi, it looks good! I used perplexity of 12 for your reference. And the iteration was 1000 epochs. I think the quality of clustering is affected by both the hyper-params of t-sne and the random effect in model training process. Try to lower the learning rate at the ending of the training, I think the network will converge better.

huukim136 commented 4 years ago

Thank you so much!

youngsuenXMLY commented 4 years ago

hi, it seems you have reproduced the results. What else preprocessing have you done? @huukim136

huukim136 commented 4 years ago

What else preprocessing have you done

Hi @youngsuenXMLY , I did nothing but only normalizing the mel features as the author recommended. In addition, remember to reduce the learning rate gradually as you train the model, you'll get good result.

ivancarapinha commented 4 years ago

What else preprocessing have you done

Hi @youngsuenXMLY , I did nothing but only normalizing the mel features as the author recommended. In addition, remember to reduce the learning rate gradually as you train the model, you'll get good result.

Hello @huukim136, @jxzhanggg. At what pace should the learning rate be reduced? Also, when you say "normalizing the mel features" are you referring to the normalization of mel-spectrograms in file extract_features.py, by setting norm=1? https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/e2fe19592b8c3a8189b609f890f1c8870b1ca0ed/pre-train/reader/extract_features.py#L26

Thank you very much

youngsuenXMLY commented 4 years ago

@ivancarapinha the normalizing process: (x - x_mean)/x_std_var, x_mean is the global mean and x_std_var is the global standard variance. For the learning rate, I reduce the lr by a factor alpha=0.95. lr = lr*alpha if training_steps%1000==0.

ivancarapinha commented 4 years ago

@youngsuenXMLY, what data did you use to compute the global mean and global standard variance? Did you use all mel-spectrograms / spectrograms from the 99 speakers, or only the ones in the training set? Is it necessary to trim leading and trailing silence?

Thank you.

youngsuenXMLY commented 4 years ago

@ivancarapinha

I use all data(99 speakers' data) to generate the global mean and variance.
I use librosa to trim the silence part.

odcowl commented 4 years ago

Hi @huukim136, I am trying to visualize the speaker and linguistic embedding as you did, for the linguistic embedding, we want to use text_hidden and audio_seq2seq_hidden as input, it's that what you used for your second figure? But if we use it, each sentence has a different length of phonemes, so the output has a different size, it seems like Tsne Algo prefers a uniform size for its input so did you do a normalizing part for this too?

Thank you!

huukim136 commented 4 years ago

utput has a different size

Yes, exactly in the second figure I use audio_seq2seq_hidden as input. For example, audio_seq2seq_hidden has the shape of (L, 512), I calculate the mean of all L step to obtain a single (1,512) vector.