1) Don't mask the noised speech (aka flow step) on input
2) Provide the masked speech conditioning separately
3) Embed the noised speech, masked speech conditioning, and text separately and combine them
To be honest this doesn't change the training dynamics of the network very much, but it seems like it might be useful to align the implementation with the paper.
1) Don't mask the noised speech (aka flow step) on input 2) Provide the masked speech conditioning separately 3) Embed the noised speech, masked speech conditioning, and text separately and combine them
To be honest this doesn't change the training dynamics of the network very much, but it seems like it might be useful to align the implementation with the paper.