Closed WuyangFF95 closed 2 years ago
Dear @WuyangFF95 ,
Thanks for your questions. I can give you some quick answers here, but for more detail you should look at our preprint.
We use a Bayesian NMF with a Dirichlet prior. The default prior is a flat, uninformative prior.
Our implementation is probabilistic. Using MCMC we sample regions of parameter space with a frequency proportional to the posterior probability of the parameter values. (Parameters here means the values in the exposures and signatures matrices). We provide a few different probability models to calculate this posterior probability. The default is the multinomial distribution. For more detail you should look at the preprint.
We don't use either bootstrapping or clustering. We use MCMC to draw lots of samples of plausible parameter values, which we then average over.
We use a heuristic to estimate K. Basically, increasing K increases the model's goodness of fit to the data, and we stop increasing K when the benefit in terms of increased goodness of fit becomes small (for some definition of small). There are more principled Bayesian techniques of model selection, like bridge sampling or LOOIC, but we don't currently use them.
Hope this helps. I'd be interested to know which packages make numbers 1 and 2 on your list!
Kevin
Thank you @kgori!
Two packages which have better performance than yours are:
Nicola D Robert's hdp package: https://github.com/nicolaroberts/hdp This package used a hierarchical version of Latent Dirichlet Allocation (LDA), which involves multiple layers of Dirichlet Process. Each layer can be a tumor type (e.g. lung cancer, breast cancer), an individual tumor, or the mutation from a specific tumor. It has several advantages:
Ludmil B Alexandrov's SigProExtractor: https://pypi.org/project/sigproextractor/ This is the Python implementation of Ludmil Alexandrov's WTSI framework. Because MatLab is not free, Alexandrov Lab is mainly working on the Python version. Citation: https://www.nature.com/articles/s41586-020-1943-3
Dear @kgori ,
I am a PhD student in Bioinformatics, and I benchmarked your software package sigfit with other software packages using my in-house generated dataset. I found your package was the third best out of 11 software packages, but I need to write out why your software package was good in my discussion section of the paper.
Therefore, I have four questions related to your software package:
I would be very appreciated if you can answer them. Thanks!