keonlee9420 / Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
MIT License
189 stars 45 forks source link

Can we train with this yet? #10

Open EmElleE opened 3 years ago

EmElleE commented 3 years ago

Just wondering if we can train with LJS on this implementation thanks!

keonlee9420 commented 3 years ago

Hi @EmElleE , yes you could but need to tune hparam for residual encoder and it is really close to.

ArEnSc commented 3 years ago

@keonlee9420 quick question do you have the LJS Model? I would like to finetune on this, do you know how much data is required for fine tuning? also is the quality close to tacotron2? it seems like these days people use tacotron2 because it works well cloning voices. Do you think Parallel-Tacotron2 is similar or capable ?

keonlee9420 commented 3 years ago

Hi @ArEnSc , I don't have it yet, but I'll share when I get it. But please note that the result would be much worse than expected since the maximum batch is too small compared to the original paper.

huypl53 commented 3 years ago

Take a look at this:

speaker_embedding_m = speaker_embedding.unsqueeze(1).expand(
    -1, max_mel_len, -1
)

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)

enc_input = torch.cat([position_enc, speaker_embedding_m, mel], dim=-1)

speaker_embedding_m and mel both have max_mel_len in channel-1, but position_enc has max_seq_len+1 which is different. Therefore torch.cat will raise exception Am I right?

keonlee9420 commented 3 years ago

Hi @phamlehuy53 , position_enc also has max_seq_len in that dimension.

huypl53 commented 3 years ago

Hi @phamlehuy53 , position_enc also has max_seq_len in that dimension.

But you notice that speaking_embedding_m and mel have max_mel_len instead, don't you?

keonlee9420 commented 3 years ago

oh, sorry I mistyped. position_enc has max_mel_len, not max_seq_len.

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)
huypl53 commented 3 years ago

oh, sorry I mistyped. position_enc has max_mel_len, not max_seq_len.

position_enc = self.position_enc[
    :, :max_mel_len, :
].expand(batch_size, -1, -1)

Yep, when max_mel_len is higher than max_seq_len, the 1st dim of position_enc is sitll max_seq_len in length. That makes mismatch of dim in torch.cat's arguments I'm sorry for missing this info in first question. Tks!