Closed tinyheero closed 3 years ago
Hi, the ranking in this plot is after the fitting of the SNVs. Can you post the B-allele frequency plot as well? Likely that 0.5 does not fit the SNP allelic fractions very well.
Hi @lima1
Thanks for the reply. Here are the BAF plots for the 11% and 55% purity solutions, respectively:
For the record, the true purity is 10% as this is simulated data.
Just so I understand what you are saying. The first overview plot (with the 11% purity solution being ranked first) does NOT include the fitting of the SNVs?
While the second overview plot (with the 55% purity as ranked first and after the running .rankResults
) does include the fitting of SNVs?
Does that mean the .rankResults
function was actually NOT run in runAbsoluteCN
since the first overview plot is based on the direct output of runAbsoluteCN
?
Hmm, the copy number profile is clean, but the SNPs look weird. Can you share the log file as well? Does your VCF contain both germline and somatic variants?
X_NESS-SCNA-1-P10_SU_T1-R1.purecn.vcf
: The VCF file that went into runAbsoluteCN
. This contains both germline and somatic variants.X_NESS-SCNA-1-P10_SU_T1-R1.purecn.rds
: The direct output of runAbsoluteCN
X_NESS-SCNA-1-P10_SU_T1-R1.purecn.pdf
: The BAF plots of the different candidate solutionX_NESS-SCNA-1-P10_SU_T1-R1.purecn.log
: The log file From the RDS file, you can run the following to reproduce the overview plot I showed above:
library("PureCN")
ret <- readRDS("X_NESS-SCNA-1-P10_SU_T1-R1.purecn.rds")
plotAbs(ret, type = "overview")
> packageVersion("PureCN")
[1] '1.20.0'
Thanks, got it. Will have a look later today.
Thanks. Sorry just to clarify, the RDS output is actually the output from PureCN::bootstrapResults
that is run after runAbsoluteCN
. As far as I can tell, this hasn't changed anything. But just in case it matters.
Is this simulated data? If so, then the germline allelic fractions aren't properly simulated. Try a binomial distribution (or beta bin if you want to add some noise) and don't forget to simulate both alleles. The VCF only contains the major allele, that's why none of the SNPs are < 0.5.
Oops, you were mentioning it's simulated data, and sorry, I often get a similar but different question and didn't read your post properly.
In that case, I think PureCN recognizes that 0.5 is unlikely correct and picks the lower purity one. My guess is if you simulate the SNPs more realistically, 0.11 should be the clear winner.
Thanks @lima1.
I'll take a look at the germine allelic fractions.
I guess what I don't undertstand is when does PureCN do that final selection? The reason why this is important is because we use the log-likelihood values for some downstream analysis. We've been assuming that they are reflective of the solution ranks. But this example seems to suggest otherwise (i.e. the solution ranks do not correlate 100% with their log-likelihood). As such, it suggests there is some unaccounted for ranking/selection step that is throwing our downstream analysis off.
Whenever the top ranking solution has no copy number alterations (different from 2/1), it picks the diploid solution with lowest purity. This was mainly added for cfDNA when there are no somatic events and PureCN had nothing to work with, just to not confuse users who don't read the warnings.
I've never seen this step applied to cases with clear CNAs, but in this case it seems to be because of the simulated SNPs. If you add some stochastic noise, then PureCN should see that is can't be 0.5. (The beta model simply tops the coverage at 300X to get some variance into the fitting. Beta-bin is more accurate and establishes the noise in a pool of normals, but thus works best with a larger PoN).
Oh I see. I looked into the runAbsoluteCN
function again and found this line (https://github.com/lima1/PureCN/blob/master/R/runAbsoluteCN.R#L1096)
results <- .findClosestSolution(results,
purity = min(test.purity), ploidy = 2, ploidy.div = 1)
It seems this .findClosestSolution
function (https://github.com/lima1/PureCN/blob/97629fdbe9b89b3ab48e1d93cc2d036a63f73672/R/readCurationFile.R#L88) is called at the very end and appears to be doing this final ranking step. It doesn't update the log-likelihood and only the ranking. Would this explain the final ranking?
Yep. But again, I don't think you would ever get a "NON-ABERRANT" flag when you simulate the SNPs more realistically. Let me know when you do!
Great to solve that mystery.
Regarding the SNP data, it's a bit more complex. We are using simulated sequencing data and we are indeed simulating both SNPs. But we do a denoising step and then use mirrored BAF as input. The mirrored BAF explain why our allelic fractions are never < 0.5. How the additional denoising step is playing a role is unclear to me. Regardless, it would suggest that these two steps are negatively impacting PureCN.
You would not be able to do this denoising step in real data, right? Before the segmentation step at least?
The mirroring probably has little impact, but because you remove noise, and PureCN expects noise, the SNPs fit the 0.55 solution reasonably well. If you would keep the noise, you would have more SNPs more far away from the 0.5 when you have CNAs (thus the real value different from 0.5). That will lower the likelihood of the wrong solution.
Combining the two models, coverage and SNPs, is quite frankly also a bit ad hoc and empirically tuned to make it work well across a wide range of very different datasets. Not sure how the simulation affects this combined ranking, but would assume it lowers the impact of SNPs as well since they are so clean (low ranking is mostly due to very poor fitting).
You would not be able to do this denoising step in real data, right? Before the segmentation step at least?
No we've implemented a method to do the denoising before segmentation to help clean up the input signal. It's good to know that PureCN expects noise. It would suggest that the denoising is actually making things worse. Is there a way to tell PureCN to not expect noise?
Fair enough, I assume this will be part of your benchmarking anyways.
Whatever you are doing to the tumor VCFs, do the same with your normals. Then merge all the normals, either in a multi-sample VCF or GenomicsDB. The latter however requires additional dependencies in PureCN (see Best Practices vignette). Then use the NormalDB.R script to generate a mapping bias database (again, see BP vignette). Finally, set —model betabin in PureCN.R and provide the generated mapping bias RDS file.
Good luck!
I've got a strange scenario where the ranking of the solutions don't appear to match up with the reported log-likelihood values. Here is the output overview plot when you take the PureCN output directly:
When you extract the log-likelihood values extracted from the output (
ret
) of therunAbsoluteCN
function:Here the top ranked solution (i.e. 1) doesn't actually have the highest
total_llik
. In fact, it's the 3rd ranked solution that actually has the highesttotal_llik
.Based on the code, there appears to be a
.rankResults
function that should be run in therunAbsoluteCN
function. When you do this:And then plot the overview:
You get a overview plot that has the solutions ranked correctly according to the log-likelihood values
Any idea what could explain what happened here? This is based on Pure CN 1.20.0.