mskcc / facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.
135 stars 65 forks source link

Log likelihood reported may be negative #130

Open dvenet opened 5 years ago

dvenet commented 5 years ago

The log likelihood reported can sometimes be largely negative. My data are quite sparse TGS data, so I use a low min.het and cval. Facets complains about data quality, however the outputs are quite often OK and I would like to use the loglik to sort out the best one.

For instance, I run facets twice on the same data (rcmat), and I get logliks with opposite signs. Practically it is often a similar magnitude, although logliks between 0 and 50 are also found sometimes. Depending on the sample I can get everything from 0 to ~80% of negative logliks.

xx = preProcSample(rcmat, ndepthmax=1e5, cval=10) oo=procSample(xx,cval=50, min.nhet=3) fit=emcncf(oo, min.nhet=3, maxiter=10); fit$loglik [1] 161.6677 fit$emflags [1] " Noisy sample, Calls can be unreliable."

xx = preProcSample(rcmat, ndepthmax=1e5, cval=10) oo=procSample(xx,cval=50, min.nhet=3) fit=emcncf(oo, min.nhet=3, maxiter=10); fit$loglik [1] -160.4713 # The 2nd seems about OK except for the sign ;-) fit$emflags [1] " Noisy sample, Calls can be unreliable. Low purity. Calls can be unreliable."

I assume the loglik has an implied minus sign, otherwise the problem is with the first run.

shenmskcc commented 5 years ago

It is hard to diagnose the problem why you get a very large log-likelihood in the first run without seeing the data. The flags suggest a noisy sample/low purity. These cases are always challenging. If you can show the plots in each run that may provide additional information.

Ronglai

On Mon, May 6, 2019 at 2:47 PM dvenet notifications@github.com wrote:

The log likelihood reported can sometimes be largely negative. My data are quite sparse TGS data, so I use a low min.het and cval. Facets complains about data quality, however the outputs are quite often OK and I would like to use the loglik to sort out the best one.

For instance, I run facets twice on the same data (rcmat), and I get logliks with opposite signs. Practically it is often a similar magnitude, although logliks between 0 and 50 are also found sometimes. Depending on the sample I can get everything from 0 to ~80% of negative logliks.

xx = preProcSample(rcmat, ndepthmax=1e5, cval=10) oo=procSample(xx,cval=50, min.nhet=3) fit=emcncf(oo, min.nhet=3, maxiter=10); fit$loglik [1] 161.6677 fit$emflags [1] " Noisy sample, Calls can be unreliable."

xx = preProcSample(rcmat, ndepthmax=1e5, cval=10) oo=procSample(xx,cval=50, min.nhet=3) fit=emcncf(oo, min.nhet=3, maxiter=10); fit$loglik [1] -160.4713 # The 2nd seems about OK except for the sign ;-) fit$emflags [1] " Noisy sample, Calls can be unreliable. Low purity. Calls can be unreliable."

I assume the loglik has an implied minus sign, otherwise the problem is with the first run.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mskcc/facets/issues/130, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7CKVDNOWISG7YZ5LDJ4A3PUB4LBANCNFSM4HLCOSHQ .

dvenet commented 5 years ago

Hi,

Here are figs from 2 runs from a different example. First run has loglik=-2190, second has loglik = 303 (opposite signs).

Note that the copy numbers obtained (using a run with a loglik of 282, which I took as the best) have a correlation of 85% with those obtained on the same sample using ASCAT on an SNP chip, with similar corrections for purity and ploidy, which is quite good IMHO for ionTorrent targeted sequencing (and much better than the output from ionTorrent software, although I am not the one who ran said software so maybe it can do better with other parameters).

f1.pdf f2.pdf f3.pdf f4.pdf

Thanks for looking this up,

David

veseshan commented 5 years ago

What strikes me is the sparsity of het snps in the data. So the fits are not very robust. Is this a custom assay? Can you give us summary statistics on number of loci and proportion of het snps? Thanks.

dvenet commented 5 years ago

It's quite sparse indeed, being TGS (400 genes or so). Although I was sort of planning to try FACETS on TGS on 21 genes after, so this is still the easy case.

I've sampled every 50 bases in addition to the SNPs, and I end up with 48054 points including 598 heterogeneous SNPs. I've had to smooth a bit the data to make it OK, so there's a bit of extra auto-correlation in that.

shenmskcc commented 5 years ago

The second run is indeed the better call. The first one systematically overcalls LOH with reduced cellular fractions across the board.

The sparsity of het SNPs and the complexity of copy number alterations in this sample created challenges for likelihood maximization with local maxima that the EM can be trapped into. Using the loglik to choose the solution is a good idea.

On Thu, May 9, 2019 at 11:22 AM dvenet notifications@github.com wrote:

It's quite sparse indeed, being TGS (400 genes or so). Although I was sort of planning to try FACETS on TGS on 21 genes after, so this is still the easy case.

I've sampled every 50 bases in addition to the SNPs, and I end up with 48054 points including 598 heterogeneous SNPs. I've had to smooth a bit the data to make it OK, so there's a bit of extra auto-correlation in that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mskcc/facets/issues/130#issuecomment-490949518, or mute the thread https://github.com/notifications/unsubscribe-auth/AC7CKVEBRQSPPX5YZ2QNQ3LPUQ6TNANCNFSM4HLCOSHQ .

dvenet commented 5 years ago

OK, thanks for confirming that using the loglik is good idea. Higher loglik is better if I understand correctly, however loglik higher than 0 are impossible normally - I suppose there is some normalization factor that is not calculated.

I've modified the code so that it displays the loglik when tracing, and its behavior is strange at times. For cases that end with a normal loglik (say 300 here), it goes e.g. from 265, to 285, then 300 so OK.

However, for cases with negative loglik it sometimes starts OK and then jumps down - e.g.

fit = emcncf(oo, min.nhet=tty[1], maxiter=10, trace=TRUE); iter: 1 dif: 0.2225225 purity: 0.4914665 ploidy: 2 loglik: 269.128 iter: 2 dif: 0.02278803 purity: 0.3203154 ploidy: 2 loglik: 296.7563 iter: 3 dif: 0.0306998 purity: 0.2254841 ploidy: 2 loglik: -2014.432 iter: 4 dif: 0 purity: 0.2254841 ploidy: 2 loglik: -1679.532