Closed Dorniwang closed 2 years ago
btw, i send the question to your email, but didn't get any response : )
Sorry for the delayed response. The weights are different for two subjects because we train the Cross-Reconstructed Emotion Disentanglement module separately for each subject. We tried to train this part using data from more subjects before, but finding that the disentanglement gets worse with more identities.
Sorry for the delayed response. The weights are different for two subjects because we train the Cross-Reconstructed Emotion Disentanglement module separately for each subject. We tried to train this part using data from more subjects before, but finding that the disentanglement gets worse with more identities.
ok, got it, thanks : )
To ensure audio emotion and speech content are disentangled, you design a Cross-Reconstructed Emotion Disentanglement module in paper. In my opinion, emotion encoder and content encoder should be freeze once the disentanglement training if finished. But i found that the two pretrained models of two different subjects you provide has totally different weights in the emotion encoder and content encoder. Thus i guess that you finetune these two encoders together with other parts when you train your audio2lm module, but how can you guarantee the disentanglement once you finetune these two encoders?