Closed WuyangFF95 closed 5 years ago
Hi @WuyangFF95 ,
Unfortunately no, choosing the best K is difficult and a bit of a dark art (not just for topic models).
One approach you can use is to plot the average per-word log-likelihood values across a range of K and use the elbow method. The fit! functions return the log-likelihood. There are other options for choosing K that you could look into as well, and you could even look at a collection of metrics.
Hi @funnell , Can you provide an easy-to-use function to plot the average per-word (per-mutation) log-likelihood? And can you explain a little bit about "the elbow method"? Thanks!
Hi @WuyangFF95,
Say you have your log-likelihood values formatted like the attached file. mmctm_snv_sv_ll.tsv.zip
You could use the following code to plot them using VegaLite
using CSV
using DataFrames
using VegaLite
ll = CSV.read("mmctm_snv_sv_ll.tsv", delim='\t');
ll |> @vlplot(:line, x=:K, y={:Log_likelihood, scale={zero=false}}, row=:Type,
resolve={scale={y=:independent}})
Make sure you have installed the CSV, DataFrames, and VegaLite packages.
The idea is to choose a number of signatures where adding more signatures doesn't change the log-likelihood very much (or whatever other metric you are using). Unfortunately the best value is still often not very clear! There are lots of other ways to try and find the optimal K, but this is a starting point.
Hi @funnell
Now the question is, to calculate the log-likelihood, I need to know the probability function used in your package. But since I don't know the function, I cannot calculate it.
Can you write the probability function in close form for CTM, MMCTM, and LDA? And also can you tell me the default values of those hyperparameters? You can choose not to disclose these functions if you want to keep them as a secret.
Also, can you add a Julia script into your package to calculate the log-likelihood automatically? Because I'm new in Julia (I am working with R and Python), and I think that most of the users (who are biologists probably with poor programming or statistics background) are not good at using scripts to calculate log-likelihood manually.
Another question is, if you don't add any regularization term (a.k.a., penalty term), using maximum likelihood selection will always select highest K (signature number).
I guess you can add a script which selects best number of K using a regularized log-likelihood function. And also you can tell me the regularized log-likelihood function you used.
Hi @WuyangFF95, there is a function, calculate_loglikelihoods(model::MMCTM)
. So you can try and run MultiModalMuSig.calculate_loglikelihoods(mymodel) to get the log likelihoods. It will output an array with a log likelihood value for each modality. I've written a bit about how the likelihood is calculated in the paper: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006799#sec010
except that function above will compute the likelihood for the counts used for training that are contained in the model object. Does that make sense?
There is another function calculate_loglikelihoods(X::Vector{Vector{Matrix{Int}}}, model::MMCTM)
where you can input different counts and calculate the log likelihood given these counts and a fitted model.
Sorry @funnell , I was away for a holiday, now I'm back in the office and I'll tell you whether it can work in this week.
Hi @funnell , is calculate_loglikelihoods() in MultiModalMuSig package? I cannot find it inside of the package.
As you can see in this screenshot, I failed to find a function named "calculate_loglikelihoods()" inside of the package.
Hi @WuyangFF95, sorry for the confusion!
Try this: MultiModalMuSig.calculate_loglikelihoods(model)
(i.e. use .
not ::
)
It works! Thank you @funnell!
Dear Shah Compbio Lab Members,
I want to run MultiModalMuSig with a matrix file recording mutation counts of certain samples, but I don't know the number of signatures active in these samples.
I have two questions:
Thanks!