Can MultiModalMuSig automatically determine number of signatures in a count matrix?

shahcompbio / MultiModalMuSig.jl

A Julia package for extracting mutation signatures using topic models

Other

18 stars 4 forks source link

Can MultiModalMuSig automatically determine number of signatures in a count matrix? #2

Closed WuyangFF95 closed 5 years ago

WuyangFF95 commented 5 years ago

Dear Shah Compbio Lab Members,

I want to run MultiModalMuSig with a matrix file recording mutation counts of certain samples, but I don't know the number of signatures active in these samples.

I have two questions:

Can MultiModalMuSig automatically determine number of signatures (K) in a count matrix? I mean, do you have a function which can choose the best K for me?
If not, I have to rerun MultiModalMuSig for multiple Ks. Do you have a criterion or suggestion to choose the best K among a range of Ks?

Thanks!

funnell commented 5 years ago

Hi @WuyangFF95 ,

Unfortunately no, choosing the best K is difficult and a bit of a dark art (not just for topic models).
One approach you can use is to plot the average per-word log-likelihood values across a range of K and use the elbow method. The fit! functions return the log-likelihood. There are other options for choosing K that you could look into as well, and you could even look at a collection of metrics.

WuyangFF95 commented 5 years ago

Hi @funnell , Can you provide an easy-to-use function to plot the average per-word (per-mutation) log-likelihood? And can you explain a little bit about "the elbow method"? Thanks!

funnell commented 5 years ago

Hi @WuyangFF95,

Say you have your log-likelihood values formatted like the attached file. mmctm_snv_sv_ll.tsv.zip

You could use the following code to plot them using VegaLite

using CSV
using DataFrames
using VegaLite
ll = CSV.read("mmctm_snv_sv_ll.tsv", delim='\t');
ll |> @vlplot(:line, x=:K, y={:Log_likelihood, scale={zero=false}}, row=:Type,
    resolve={scale={y=:independent}})

Make sure you have installed the CSV, DataFrames, and VegaLite packages.

The idea is to choose a number of signatures where adding more signatures doesn't change the log-likelihood very much (or whatever other metric you are using). Unfortunately the best value is still often not very clear! There are lots of other ways to try and find the optimal K, but this is a starting point.

WuyangFF95 commented 5 years ago

Hi @funnell

Now the question is, to calculate the log-likelihood, I need to know the probability function used in your package. But since I don't know the function, I cannot calculate it.

Can you write the probability function in close form for CTM, MMCTM, and LDA? And also can you tell me the default values of those hyperparameters? You can choose not to disclose these functions if you want to keep them as a secret.

Also, can you add a Julia script into your package to calculate the log-likelihood automatically? Because I'm new in Julia (I am working with R and Python), and I think that most of the users (who are biologists probably with poor programming or statistics background) are not good at using scripts to calculate log-likelihood manually.

WuyangFF95 commented 5 years ago

Another question is, if you don't add any regularization term (a.k.a., penalty term), using maximum likelihood selection will always select highest K (signature number).

I guess you can add a script which selects best number of K using a regularized log-likelihood function. And also you can tell me the regularized log-likelihood function you used.

funnell commented 5 years ago

Hi @WuyangFF95, there is a function, calculate_loglikelihoods(model::MMCTM). So you can try and run MultiModalMuSig.calculate_loglikelihoods(mymodel) to get the log likelihoods. It will output an array with a log likelihood value for each modality. I've written a bit about how the likelihood is calculated in the paper: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006799#sec010 except that function above will compute the likelihood for the counts used for training that are contained in the model object. Does that make sense?

funnell commented 5 years ago

There is another function calculate_loglikelihoods(X::Vector{Vector{Matrix{Int}}}, model::MMCTM) where you can input different counts and calculate the log likelihood given these counts and a fitted model.

WuyangFF95 commented 5 years ago

Sorry @funnell , I was away for a holiday, now I'm back in the office and I'll tell you whether it can work in this week.

WuyangFF95 commented 5 years ago

Hi @funnell , is calculate_loglikelihoods() in MultiModalMuSig package? I cannot find it inside of the package.

WuyangFF95 commented 5 years ago

No-calculate-log-likelihood

As you can see in this screenshot, I failed to find a function named "calculate_loglikelihoods()" inside of the package.

funnell commented 5 years ago

Hi @WuyangFF95, sorry for the confusion! Try this: MultiModalMuSig.calculate_loglikelihoods(model) (i.e. use . not ::)

WuyangFF95 commented 5 years ago

It works! Thank you @funnell! log-likelihood