scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.98k stars 25.18k forks source link

Dirichlet Multinomial Mixture Model #24972

Open Jck-R opened 1 year ago

Jck-R commented 1 year ago

Describe the workflow you want to enable

Is there an intention to implement the Dirichlet Multinomial Mixture Model with EM algorithm?

Dirichlet Multinomial Mixture Model is a popular clustering model in NLP, Information Retrieval and Bioinformatics.

Describe your proposed solution

I have implemented a usable Dirichlet Multinomial Mixture Model with the Base Mixture class in scikit-learn. If it is needed, I can refine the doc and details and pull it to scikit-learn.

https://github.com/Jck-R/pyDIMM

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Micky774 commented 1 year ago

Hey there @Jck-R, thanks for opening an issue for this. Regarding inclusion of the model, I believe there's enough credibility to the underlying model to warrant consideration, however there's a few things we would like to see to make inclusion easier or more likely:

  1. Do you have meaningful use cases in which this model is state of the art, or at least competitive? I'm familiar with some of its use in bioinformatics, however especially with NLP it seems to have fallen a bit out of favor (from what I've seen).
  2. Could you provide testing associated with your implementation? Rigorous tests on the underlying mechanisms are especially useful in reviewing and understanding the algorithm.
  3. Could you provide example uses of this algorithm on datasets, comparing against existing scikit-learn estimators which demonstrate that this proposed model is, in some sense, superior or preferable?
  4. Could you provide benchmarks regarding memory footprint and computation time to achieve competitive results? Of course the implementation will be optimized over time (e.g. during review for inclusion), but it is an important sanity-check to ensure that it is not wildly expensive.

Please let me know if you have any questions or concerns. Once again, thank you for proposing this and including your sample implementation :)

przemyslslaw commented 11 months ago

I think this would be very useful. The ideal implementation would also include a non-parametric version based on Dirichlet process, same as BayesianGaussianMixture, and would have the same interface.