Open mistycheney opened 1 year ago
Hi, thanks for your interest in our work.
We optimize an upper bound by taking the sum of Kl divergences of the individual experts and the prior distribution. For the sampling, we subsample the samples per batch based on the mixture weights. If we assume equal weights for all experts, it boils down to reconstructing (batchsize / #experts) samples from every expert. I hope this clarifies things for you. Otherwise, feel free to reach out again.
Best regards, Thomas
Thanks for producing this interesting paper. I have a confusion about the MoE part and wonder if you can clarify it.
In the paper, Eq (6) is the objective to be optimized, which involves
KL(sum_i expert_i || prior)
. Suppose expert_i is Gaussian, how do you compute the KL between a mixture-of-gaussian and the prior? I don't think this has a closed form.I tried to find the answer in the code, and came down to moe_fusion and mixture_component_selection. It seems to be performing some sort of sampling. Is this the same as the importance sampling in MMVAE (Shi 2019)?
Any clarification would be much appreciated. Thank you.