Open EmElleE opened 3 years ago
Hi @EmElleE , yes you could but need to tune hparam for residual encoder and it is really close to.
@keonlee9420 quick question do you have the LJS Model? I would like to finetune on this, do you know how much data is required for fine tuning? also is the quality close to tacotron2? it seems like these days people use tacotron2 because it works well cloning voices. Do you think Parallel-Tacotron2 is similar or capable ?
Hi @ArEnSc , I don't have it yet, but I'll share when I get it. But please note that the result would be much worse than expected since the maximum batch is too small compared to the original paper.
Take a look at this:
speaker_embedding_m = speaker_embedding.unsqueeze(1).expand(
-1, max_mel_len, -1
)
position_enc = self.position_enc[
:, :max_mel_len, :
].expand(batch_size, -1, -1)
enc_input = torch.cat([position_enc, speaker_embedding_m, mel], dim=-1)
speaker_embedding_m
and mel
both have max_mel_len
in channel-1, but position_enc
has max_seq_len+1
which is different. Therefore torch.cat
will raise exception
Am I right?
Hi @phamlehuy53 , position_enc
also has max_seq_len
in that dimension.
Hi @phamlehuy53 ,
position_enc
also hasmax_seq_len
in that dimension.
But you notice that speaking_embedding_m
and mel
have max_mel_len
instead, don't you?
oh, sorry I mistyped. position_enc
has max_mel_len
, not max_seq_len
.
position_enc = self.position_enc[
:, :max_mel_len, :
].expand(batch_size, -1, -1)
oh, sorry I mistyped.
position_enc
hasmax_mel_len
, notmax_seq_len
.position_enc = self.position_enc[ :, :max_mel_len, : ].expand(batch_size, -1, -1)
Yep, when max_mel_len
is higher than max_seq_len
, the 1st dim of position_enc
is sitll max_seq_len
in length. That makes mismatch of dim in torch.cat
's arguments
I'm sorry for missing this info in first question. Tks!
Just wondering if we can train with LJS on this implementation thanks!