Doubt regarding expectation maximization

Hello, I noticed that in your implementation of the PANTHER model, the forward function calls the map_em method, which performs both the Expectation and Maximization (EM) steps of the algorithm. Specifically, the forward function estimates the parameters pi, mu, and Sigma, and then stacks them. However, traditionally in GMM, the workflow separates the fit (EM algorithm on training data) and predict (using the learned parameters on test data) steps. Since the map_em method includes both the Expectation and Maximization steps, which is typically associated with fitting the model, it seems unusual to include the Maximization step in the forward pass. Could you explain why the PANTHER model doesn't follow the traditional GMM structure with separate fit and predict steps, and why the Maximization step is included during the forward pass? How does this design choice affect the model's training and inference processes? Thanks.

Hi @Rukhmini

Yes, there are several possible answers to this.

In traditional EM, first introduced in the seminal paper by Professor Arthur Dempster in 1977 and many subsequent works afterwards, it has been primarily used for maximum likelihood estimation. In this setting, the goal is to find the optimal set of parameters that best explains the training data (the criteria being maximizing the likelihood), so it is natural to have both the E-step and M-step in the training dataset, trying to have fit the best model.
In GMM and more broadly clustering, the goal is to identify optimal set of GMM parameters or clusters (centroids) on the given training dataset first. If a new test point comes along, then we can definitely "predict" its cluster or mixture assignment using the fitted model/cluster. In pathology & PANTHER setting, this would be akin to having a new test patch (or patches) in the WSI, with its mixture/cluster assignment yet unassigned (which is what you mean by "traditional" step)
However, this is not what PANTHER is intended to. In our setting, the new "test" data is a new WSI and for each WSI, we are fitting an entirely new GMM model. Simply think of GMM as a data compression algorithm for each corresponding WSI.
For additional references of EM within neural network setting, you can refer to following reference ref, which motivated PANTHER.

Hope this helps!

mahmoodlab / PANTHER

Doubt regarding expectation maximization #14