Questions related to the model design and inference method in sigfit-NMF

WuyangFF95 commented 4 years ago

Dear @kgori ,

I am a PhD student in Bioinformatics, and I benchmarked your software package sigfit with other software packages using my in-house generated dataset. I found your package was the third best out of 11 software packages, but I need to write out why your software package was good in my discussion section of the paper.

Therefore, I have four questions related to your software package:

For statistical model, did sigfit-NMF use frequentist NMF (similar to Ludmil Alexandrov's SigProExtractor, a.k.a. WTSI-framework, which doesn't have a prior), or Bayesian NMF with Dirichlet prior?
NMF has a lot of implenmentations. Which implementation did you use in the sigfit-NMF design? And what kind of divergence (Bregman, Frobenius, etc.) did you try to minimize?
Does inference algorithm of sigfit-NMF involve bootstrapping? Does it involve clustering?
How does your package select the best signature number (K)? Using BIC?

I would be very appreciated if you can answer them. Thanks!

kgori commented 4 years ago

Dear @WuyangFF95 ,

Thanks for your questions. I can give you some quick answers here, but for more detail you should look at our preprint.

We use a Bayesian NMF with a Dirichlet prior. The default prior is a flat, uninformative prior.
Our implementation is probabilistic. Using MCMC we sample regions of parameter space with a frequency proportional to the posterior probability of the parameter values. (Parameters here means the values in the exposures and signatures matrices). We provide a few different probability models to calculate this posterior probability. The default is the multinomial distribution. For more detail you should look at the preprint.
We don't use either bootstrapping or clustering. We use MCMC to draw lots of samples of plausible parameter values, which we then average over.
We use a heuristic to estimate K. Basically, increasing K increases the model's goodness of fit to the data, and we stop increasing K when the benefit in terms of increased goodness of fit becomes small (for some definition of small). There are more principled Bayesian techniques of model selection, like bridge sampling or LOOIC, but we don't currently use them.

Hope this helps. I'd be interested to know which packages make numbers 1 and 2 on your list!

Kevin

WuyangFF95 commented 4 years ago

Thank you @kgori!

Two packages which have better performance than yours are:

Nicola D Robert's hdp package: https://github.com/nicolaroberts/hdp This package used a hierarchical version of Latent Dirichlet Allocation (LDA), which involves multiple layers of Dirichlet Process. Each layer can be a tumor type (e.g. lung cancer, breast cancer), an individual tumor, or the mutation from a specific tumor. It has several advantages:

HDP model can assure that tumors of different types have different exposures to different sets of mutational signatures. Therefore, it can extract clean signatures from data with multiple tumor types, or from data where some signatures have strong positive correlations. NMF-based packages, however, will sometimes extract blends of signatures from datasets with multiple tumor types or strong positive correlation between signatures.
It uses MCMC to infer number of mutational signatures (K) active in tumors. You don't need to provide a range of K in advance, you just need to have an initial guess of K instead. Given sufficient number of iterations, it will update the initial K.guess to approach the ground-truth K. I found this package outperforms all NMF-based packages in all of my test cases. Its caveat is long running time (may take a day) and huge memory consumption.

Ludmil B Alexandrov's SigProExtractor: https://pypi.org/project/sigproextractor/ This is the Python implementation of Ludmil Alexandrov's WTSI framework. Because MatLab is not free, Alexandrov Lab is mainly working on the Python version. Citation: https://www.nature.com/articles/s41586-020-1943-3

kgori / sigfit

Questions related to the model design and inference method in sigfit-NMF #52