jixinya / EVP

Code for paper 'Audio-Driven Emotional Video Portraits'.
291 stars 49 forks source link

The effectiveness on "Cross-Reconstructed Emotion Disentanglement" module #14

Closed Dorniwang closed 2 years ago

Dorniwang commented 2 years ago

To ensure audio emotion and speech content are disentangled, you design a Cross-Reconstructed Emotion Disentanglement module in paper. In my opinion, emotion encoder and content encoder should be freeze once the disentanglement training if finished. But i found that the two pretrained models of two different subjects you provide has totally different weights in the emotion encoder and content encoder. Thus i guess that you finetune these two encoders together with other parts when you train your audio2lm module, but how can you guarantee the disentanglement once you finetune these two encoders?

Dorniwang commented 2 years ago

btw, i send the question to your email, but didn't get any response : )

jixinya commented 2 years ago

Sorry for the delayed response. The weights are different for two subjects because we train the Cross-Reconstructed Emotion Disentanglement module separately for each subject. We tried to train this part using data from more subjects before, but finding that the disentanglement gets worse with more identities.

Dorniwang commented 2 years ago

Sorry for the delayed response. The weights are different for two subjects because we train the Cross-Reconstructed Emotion Disentanglement module separately for each subject. We tried to train this part using data from more subjects before, but finding that the disentanglement gets worse with more identities.

ok, got it, thanks : )