Question about Mixture Distribution

Hello, first of all thank you for this great work. I have a question about the action decoder. Here you are using a mixture distribution of 10 logistic distributions. Using linear layers after the RNN you have predicted the means, logit_scales, and logit_probs for the distributions. I am wondering how exactly you train the distributions. I know that the best distribution is the most weighted, but what exactly does that mean for the other distributions in terms of mean, logit_scale, and logit_probs? What is the effect of backpropagation here? Naively, shouldn't all 10 distributions converge to the optimum?

Thank you for your time and help.

lukashermann / hulc

Question about Mixture Distribution #13