oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.27k stars 175 forks source link

Gaussian Mixture Model capability #359

Open cbrautigam2 opened 7 months ago

cbrautigam2 commented 7 months ago

Hi,

I need to port some Matlab code to java and I'm looking at what is out there in Java land that can do Gaussian Mixture Models. Specifically, the code that I have to port is making heavy use of Matlab's gmdistribution https://www.mathworks.com/help/stats/gmdistribution.html and fitgmdist https://www.mathworks.com/help/stats/fitgmdist.html. I see that Tribuo alludes to Gaussian Mixtures in the KMeans tutorial: https://tribuo.org/learn/4.3/tutorials/clustering-tribuo-v4.html. So maybe this would suffice? I'm definitely not a mathematician, but I'm trying to see if Tribuo can do GMMs like these Matlab functions. It appears that Matlab supports two covariance types 'full' and diagonal'.

Can you please elaborate on Tribuo's capabilities in regards to GMMs?

Craigacp commented 7 months ago

Tribuo doesn't have an implementation of fitting GMMs. We have a data generator that can sample from them to generate example data, but it can't fit that generator to a dataset. The data generator is roughly analogous to the gmdistribution function but it's pretty limited in terms of the number of gaussians. Building a more flexible version which has the functionality of gmdistribution isn't too hard on top of what we provide (e.g. MultivariateNormalDistribution).

Implementing a basic EM algorithm to fit a GMM like fitgmdist wouldn't be too hard as we have the cholesky factorization which is used in the M step, but making something scalable requires more effort (as our matrix algebra library isn't parallel yet).

Craigacp commented 5 months ago

I've written a GMM implementation which is currently being debugged. Do you need the gmdistribution function as applied to only a distribution fit on data, or do you also want to be able to sample from a mixture distribution that you've created by hand?

cbrautigam2 commented 5 months ago

I would say both, such that you can save off the distributions for later use and can reinflate them to be used again for performing predictions.

-Craig


From: Adam Pocock @.> Sent: Sunday, April 28, 2024 3:03 PM To: oracle/tribuo @.> Cc: Craig Brautigam @.>; Author @.> Subject: [External] - Re: [oracle/tribuo] Gaussian Mixture Model capability (Issue #359)

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

I've written a GMM implementation which is currently being debugged. Do you need the gmdistribution function as applied to only a distribution fit on data, or do you also want to be able to sample from a mixture distribution that you've created by hand?

— Reply to this email directly, view it on GitHubhttps://github.com/oracle/tribuo/issues/359#issuecomment-2081654736, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF45DNWDNXTN4JVJNE6Q37LY7VP23AVCNFSM6AAAAABDIZJ7S6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGY2TINZTGY. You are receiving this because you authored the thread.Message ID: @.***>


The information contained in this e-mail and any attachments from ICR, Inc. may contain confidential and/or proprietary information, and is intended only for the named recipient to whom it was originally addressed. If you are not the intended recipient, any disclosure, distribution, or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by return e-mail and permanently delete the e-mail and any attachments.

Craigacp commented 5 months ago

Ok. You'll be able to save the model and reuse it for future predictions, but extracting a distribution object like MultivariateNormalDistribution back out of it will be a little complicated as the dimensions of the samples are based on Tribuo's feature dimensions which are named rather than indexed and getting the index is a little more work. I've thought about it a bit more today and I think I will add a MixtureDistribution class and try to add a distributions interface, but the sampling method will likely be exposed on both MixtureDistribution and GaussianMixtureModel.