Closed Tera2Space closed 8 months ago
I left it in for the curious people to try it.
Yep, i tried but loss stuck at around 6, i think it's because we give to it full output of the encoder(mu) instead of giving only output of text_base_encoder (like in naturalspeech), can it be the reason ?
I am not sure if I understand your question.. if you want to edit and paste the code here, i might understand better.
We calculate aligns:
aln_hard, aln_soft, aln_log, aln_mask = self.aligner(mu_x.transpose(1,2), x_mask, y, y_mask)
mu_x in output from encoder with information about speechprompt in it
mu_x, logw, x_mask = self.encoder(x, x_lengths, prompt_slice)
so i think using x_emb from textencoder may work better as input for aligner
x_emb = self.text_base_encoder(x_emb, x_emb_mask)
If MAS can align it with mu_x, the alignernet should be able to do it too. It will take longer to converge though.
And last question, for AlignerNet to work I only need to uncomment:
# self.aligner = Aligner(
# dim_in=encoder.encoder_params.n_feats,
# dim_hidden=encoder.encoder_params.n_feats,
# attn_channels=encoder.encoder_params.n_feats,
# )
# self.aligner_loss = ForwardSumLoss()
# self.bin_loss = BinLoss()
# self.aligner_bin_loss_weight = 0.0
and
# aln_hard, aln_soft, aln_log, aln_mask = self.aligner(
# mu_x.transpose(1,2), x_mask, y, y_mask
# )
# attn = aln_mask.transpose(1,2).unsqueeze(1)
# align_loss = self.aligner_loss(aln_log, x_lengths, y_lengths)
# if self.aligner_bin_loss_weight > 0.:
# align_bin_loss = self.bin_loss(aln_mask, aln_log, x_lengths) * self.aligner_bin_loss_weight
# align_loss = align_loss + align_bin_loss
# dur_loss = F.l1_loss(logw, attn.sum(2))
# dur_loss = dur_loss + align_loss
and comment MAS usage, right?
Yes correct.
Great, thanks a lot for the answers!
So i noticed commented aligner in pflow code, you commented it because it didn't work or it didn't improve quality much?