lucidrains / spear-tts-pytorch

Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
MIT License
249 stars 18 forks source link

Questions about code #2

Closed shanhaidexiamo closed 1 year ago

shanhaidexiamo commented 1 year ago

Thanks for your contribution to spear-tts. I have some questions about code.

  1. When training pre-train model, the input is corrupt semantic tokens and label is origin semantic tokens from wav2vec, but it seems that your input is corrupt tokens and label is the bit-wise negation of the input? source = rearrange(x[~delete_mask], '(b n) -> b n', b = batch) target = rearrange(x[delete_mask], '(b n) -> b n', b = batch)
  2. What does this code mean in line 393? to_text_logit.weight = to_text_logit.weight

Thank you

lucidrains commented 1 year ago

@shanhaidexiamo

  1. oops, 393 was an error, thanks!

re: 1 - i actually wasn't too sure what the deletion pre-training task was - is it like you described, or was it to predict the tokens that were missing from the source sequence?

lucidrains commented 1 year ago

i thought it was, for a sentence like, "cat chase a mouse", "cat mouse" -> "chase a". but you are saying it should be "cat mouse" -> "cat chase a mouse" ?

shanhaidexiamo commented 1 year ago

i thought it was, for a sentence like, "cat chase a mouse", "cat mouse" -> "chase a". but you are saying it should be "cat mouse" -> "cat chase a mouse" ?

Yes. But I'm not sure which method is more useful, it seems that the latter can fill in scattered tokens into complete sentence tokens

lucidrains commented 1 year ago

@shanhaidexiamo i'm not caught up enough on pretraining literature to know

maybe i can allow for both options?

lucidrains commented 1 year ago

@shanhaidexiamo ok, just made it an option https://github.com/lucidrains/spear-tts-pytorch/commit/ae9cad1309d698d0f83547ce2fa48a6a7fded1d5

shanhaidexiamo commented 1 year ago

@shanhaidexiamo ok, just made it an option ae9cad1

Actually I learned it from the model figure, the pre-train model use <corrupt(speech), speech> as the input and label, so I think the author just corrupt the input.

lucidrains commented 1 year ago

yeah they did, but there are many ways of corrupting

they apparently chose the 'deletion' technique with 60% deleted. or rather, said that worked the most effectively

shanhaidexiamo commented 1 year ago

yeah they did, but there are many ways of corrupting

they apparently chose the 'deletion' technique with 60% deleted. or rather, said that worked the most effectively

Alright. I will try both methods, and I am not sure which one works better when it comes to training text to semantic tokens.

lucidrains commented 1 year ago

yea, we should figure out how the deletion pretraining is done so to test their claim. anyhow, let's start with these two options