lifeiteng / vall-e

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html
https://lifeiteng.github.io/valle/index.html
Apache License 2.0
1.99k stars 320 forks source link

Question about order of operations: nar_audio_prenet and nar_audio_position #180

Closed Misha24-10 closed 5 months ago

Misha24-10 commented 7 months ago

I am working with the code in the following snippet (https://github.com/lifeiteng/vall-e/blob/9c69096d603ce13174fb5cb025f185e2e9b36ac7/valle/models/valle.py#L1193C1-L1194C53)

y_pos = self.nar_audio_position(y_emb)
y_pos = self.nar_audio_prenet(y_pos)

I am uncertain about the order of operations here. Should nar_audio_prenet be applied before nar_audio_position in this context?

lifeiteng commented 5 months ago

You can try different order, but the recommended configuration here is nn.Identity()

self.nar_audio_prenet = nn.Identity()