Deprecation of GMM functions in scikit-learn

castillohair commented 7 years ago

Apparently two of the functions that we use from scikit-learn will be deprecated soon. The following appears when running FlowCal 1.1.1 in linux with scikit-learn 0.18.1:

/home/sebastian/src/FlowCal-pypi/venv/local/lib/python2.7/site-packages/sklearn/utils/deprecation.py:52: DeprecationWarning: Class GMM is deprecated; The class GMM is deprecated in 0.18 and will be  removed in 0.20. Use class GaussianMixture instead.
  warnings.warn(msg, category=DeprecationWarning)
/home/sebastian/src/FlowCal-pypi/venv/local/lib/python2.7/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function log_multivariate_normal_density is deprecated; The function log_multivariate_normal_density is deprecated in 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)

castillohair commented 7 years ago

The whole GMM class is going to be dropped in favor of a new GaussianMixture class. Hopefully it is not too hard to replace. http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

castillohair commented 7 years ago

Initial attempt was made in 8a6ff27d883e8dc9a39b64771181d31d7ac59fb3. Note that there is no one-to-one mapping between all the parameters in GMM and GaussianMixture, as far as I can tell. Also, it might be worth revising the initialization procedure since it may be completely ignored in the new class, which apparently forces the use of kmeans or random initialization. Also, it worries me that I haven't found a way to force the class to not update the weights, which is what we'be been doing so far.

Accepting a solution to this issue will require EXTENSIVE testing to make sure that the performance of the new clustering method is at least as good as the old one. The new method seems to work well on the example files, but this is only one data point.

Also, documentation will need to be updated in several places.

castillohair commented 7 years ago

As of f81142733e73abb3707595c27c6c15bf2fa2bf72, the parameter initialization before calling GaussianMixture has been updated. One complication was that the new GaussianMixture class uses "precisions" instead of covariances, which are the inverse of the covariance matrices. So I have to take matrix inverses (which is always fun). The other complication was that we can no longer fix the weights. Previously, the weights were fixed so that each gaussian population had the same prior probability, and the GMM class had an option to not let these be updated by the optimization algorithm. with GaussianMixture, this is no longer possible. Nevertheless, it seems that with a good choice of initial parameters, the populations are identified successfully.

For testing, I used the following files. These include a bunch of bead files that I got from several people from the Tabor lab (which is the same file set that I used to calibrate GMM's parameters). Also, I included a beads file from the SEA's flow cytometer, and a file that I got from the Smolke's lab. This gives a total of 58 beads files.

There is also an ipython notebook. Inside, I compare three methods: the old one based on GMM, a new method based on GaussianMixture but without our previous initialization method (which I refer to as "New 1"), and the same plus our initialization method ("New 2"). As a reminder, the initialization method assigns events to a cluster based on their distance to the origin, and calculates the mean and standard deviation of each of these clusters as the initial estimates given to the Gaussian Mixture method.

The most important number that I look at is the matching between labels for each event, for each one of the new methods vs. the old one. If the result is the exact same for two clustering methods, the labels will be 100% the same for all events. Note that there is some degeneracy when two clusters are too close to one another, as events in the middle are randomly assigned to one cluster or the other.

The results in general are pretty bad for method "New 1", but pretty good for "New 2". The following shows the results table sorted by label matching for method "New 2" (worst to best):

Every file that is not shown has a label matching of 99% or better. If we look at the top worst files, we see that their low scores are due to how close some of the clusters are to one another, which results in the kind of degeneracy described above. The very "worst" is this one:

clusteringb0028

You can see that the clusters in methods "Old" and "New 2" are visually indistinguishable. The other bad performers have a similar situation.

At this point, I'm as confident in the new method as I was of the old method at the time we released it to the lab. I'll pull request now, but I'll be happy to throw more files into this testing pipeline.

JS3xton commented 7 years ago

Wait, so for the "worst" one (B0028?), with New 2 method, where are the ~25% of events that got labeled differently? Are they all crammed up in the top right corner and I just can't see them? (OK, upon closer inspection, I do see some small color differences in that top corner relative to the old method). And we're saying that we don't really care about that? Does the peak selection algorithm automatically throw those peaks out anyways? Would be kinda interesting to see if the downstream standard curve parameters change at all comparing the old method to the New 2 method.

But otherwise, I think I'm probably OK with this. I think it's probably a lot more useful to be compliant with current versions of sklearn than to be able to perfectly reproduce the old sklearn.mixture.GMM behavior.

castillohair commented 7 years ago

where are the ~25% of events that got labeled differently? Are they all crammed up in the top right corner and I just can't see them?

Yes, events here are from 3 beads subpopulations. Because it is impossible to tell which subpopulation each belongs to, they are assigned randomly based on probabilities (line 205 of code in pull request). Because this is random, you would expect a high degree of mismatch between different runs even with the same algorithm.

Does the peak selection algorithm automatically throw those peaks out anyways?

Yes, it should.

I think it's probably a lot more useful to be compliant with current versions of sklearn than to be able to perfectly reproduce the old sklearn.mixture.GMM behavior.

Agree.

JS3xton commented 7 years ago

you would expect a high degree of mismatch between different runs even with the same algorithm.

Hmmm. I'm admittedly not a fan of this "feature", but I acknowledge that the previous implementation did it this way too, so concerns about that are orthogonal to this issue.

castillohair commented 6 years ago

Solution merged in #266.

taborlab / FlowCal

Deprecation of GMM functions in scikit-learn #230