lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
128 stars 32 forks source link

Ranked Solutions do not match up with log-likelihood values #173

Closed tinyheero closed 3 years ago

tinyheero commented 3 years ago

I've got a strange scenario where the ranking of the solutions don't appear to match up with the reported log-likelihood values. Here is the output overview plot when you take the PureCN output directly:

purecn_overview_1

When you extract the log-likelihood values extracted from the output (ret) of the runAbsoluteCN function:

#' Get PureCN candidate solutions dataframe
#'
#' @param purecn_output
get_purecn_candidate_solutions_df <- function(purecn_output) {

  out_df <-
    purrr::map_dbl(purecn_output$results, purrr::pluck, "purity") %>%
    tibble::enframe(name = "solution_id", value = "purity") %>%
    dplyr::left_join(
      purrr::map_dbl(purecn_output$results, purrr::pluck, "ploidy") %>%
        tibble::enframe(name = "solution_id", value = "ploidy"),
      by = "solution_id"
    ) %>%
    dplyr::left_join(
      purrr::map_dbl(purecn_output$results, purrr::pluck, "log.likelihood") %>%
        tibble::enframe(name = "solution_id", value = "cn_llik"),
      by = "solution_id"
    ) %>%
    dplyr::left_join(
      purrr::map_dbl(
        purecn_output$results, purrr::pluck, "total.log.likelihood"
      ) %>%
        tibble::enframe(name = "solution_id", value = "total_llik"),
      by = "solution_id"
    ) %>%
    dplyr::left_join(
      purrr::map(purecn_output$results, purrr::pluck, "SNV.posterior") %>%
        purrr::map_dbl(purrr::pluck, "llik") %>%
        tibble::enframe(name = "solution_id", value = "snv_llik"),
      by = "solution_id"
    )

  return(out_df)
}

get_purecn_candidate_solutions_df(ret)
# A tibble: 10 x 6
   solution_id purity ploidy cn_llik total_llik  snv_llik
         <int>  <dbl>  <dbl>   <dbl>      <dbl>     <dbl>
 1           1   0.11   2.09  52660.   -284239.  -310521.
 2           2   0.28   2.06  51248.   -284150.  -309749.
 3           3   0.55   2.00  45660.   -280050.  -302880.
 4           4   0.36   4.07  51693.   -312928.  -338738.
 5           5   0.42   6.00  51211.   -340828.  -366398.
 6           6   0.24   5.13  52710.   -387899.  -414205.
 7           7   0.28   3.06  51431.   -442310.  -468001.
 8           8   0.27   1.08  51837.   -544194.  -570089.
 9           9   0.48   3.06  50170.   -695980.  -721041.
10          10   0.42   1.05  46749.  -1108316. -1131678.

Here the top ranked solution (i.e. 1) doesn't actually have the highest total_llik. In fact, it's the 3rd ranked solution that actually has the highest total_llik.

Based on the code, there appears to be a .rankResults function that should be run in the runAbsoluteCN function. When you do this:

ret$results <- PureCN:::.rankResults(ret$results)
get_purecn_candidate_solutions_df(ret)
# A tibble: 10 x 6
   solution_id purity ploidy cn_llik total_llik  snv_llik
         <int>  <dbl>  <dbl>   <dbl>      <dbl>     <dbl>
 1           1   0.55   2.00  45660.   -280050.  -302880.
 2           2   0.28   2.06  51248.   -284150.  -309749.
 3           3   0.11   2.09  52660.   -284239.  -310521.
 4           4   0.36   4.07  51693.   -312928.  -338738.
 5           5   0.42   6.00  51211.   -340828.  -366398.
 6           6   0.24   5.13  52710.   -387899.  -414205.
 7           7   0.28   3.06  51431.   -442310.  -468001.
 8           8   0.27   1.08  51837.   -544194.  -570089.
 9           9   0.48   3.06  50170.   -695980.  -721041.
10          10   0.42   1.05  46749.  -1108316. -1131678.

And then plot the overview:

PureCN:::plotAbs(ret, type = "overview")

purecn_overview_2

You get a overview plot that has the solutions ranked correctly according to the log-likelihood values

Any idea what could explain what happened here? This is based on Pure CN 1.20.0.

lima1 commented 3 years ago

Hi, the ranking in this plot is after the fitting of the SNVs. Can you post the B-allele frequency plot as well? Likely that 0.5 does not fit the SNP allelic fractions very well.

tinyheero commented 3 years ago

Hi @lima1

Thanks for the reply. Here are the BAF plots for the 11% and 55% purity solutions, respectively:

11_baf_plot 55_baf_plot

For the record, the true purity is 10% as this is simulated data.

Just so I understand what you are saying. The first overview plot (with the 11% purity solution being ranked first) does NOT include the fitting of the SNVs?

purecn_overview_1

While the second overview plot (with the 55% purity as ranked first and after the running .rankResults) does include the fitting of SNVs?

purecn_overview_2

Does that mean the .rankResults function was actually NOT run in runAbsoluteCN since the first overview plot is based on the direct output of runAbsoluteCN?

lima1 commented 3 years ago

Hmm, the copy number profile is clean, but the SNPs look weird. Can you share the log file as well? Does your VCF contain both germline and somatic variants?

tinyheero commented 3 years ago

From the RDS file, you can run the following to reproduce the overview plot I showed above:

library("PureCN")
ret <- readRDS("X_NESS-SCNA-1-P10_SU_T1-R1.purecn.rds")
plotAbs(ret, type = "overview")
> packageVersion("PureCN")
[1] '1.20.0'
lima1 commented 3 years ago

Thanks, got it. Will have a look later today.

tinyheero commented 3 years ago

Thanks. Sorry just to clarify, the RDS output is actually the output from PureCN::bootstrapResults that is run after runAbsoluteCN. As far as I can tell, this hasn't changed anything. But just in case it matters.

lima1 commented 3 years ago

Is this simulated data? If so, then the germline allelic fractions aren't properly simulated. Try a binomial distribution (or beta bin if you want to add some noise) and don't forget to simulate both alleles. The VCF only contains the major allele, that's why none of the SNPs are < 0.5.

lima1 commented 3 years ago

Oops, you were mentioning it's simulated data, and sorry, I often get a similar but different question and didn't read your post properly.

In that case, I think PureCN recognizes that 0.5 is unlikely correct and picks the lower purity one. My guess is if you simulate the SNPs more realistically, 0.11 should be the clear winner.

tinyheero commented 3 years ago

Thanks @lima1.

I'll take a look at the germine allelic fractions.

I guess what I don't undertstand is when does PureCN do that final selection? The reason why this is important is because we use the log-likelihood values for some downstream analysis. We've been assuming that they are reflective of the solution ranks. But this example seems to suggest otherwise (i.e. the solution ranks do not correlate 100% with their log-likelihood). As such, it suggests there is some unaccounted for ranking/selection step that is throwing our downstream analysis off.

lima1 commented 3 years ago

Whenever the top ranking solution has no copy number alterations (different from 2/1), it picks the diploid solution with lowest purity. This was mainly added for cfDNA when there are no somatic events and PureCN had nothing to work with, just to not confuse users who don't read the warnings.

I've never seen this step applied to cases with clear CNAs, but in this case it seems to be because of the simulated SNPs. If you add some stochastic noise, then PureCN should see that is can't be 0.5. (The beta model simply tops the coverage at 300X to get some variance into the fitting. Beta-bin is more accurate and establishes the noise in a pool of normals, but thus works best with a larger PoN).

tinyheero commented 3 years ago

Oh I see. I looked into the runAbsoluteCN function again and found this line (https://github.com/lima1/PureCN/blob/master/R/runAbsoluteCN.R#L1096)

results <- .findClosestSolution(results, 
            purity = min(test.purity), ploidy = 2, ploidy.div = 1)

It seems this .findClosestSolution function (https://github.com/lima1/PureCN/blob/97629fdbe9b89b3ab48e1d93cc2d036a63f73672/R/readCurationFile.R#L88) is called at the very end and appears to be doing this final ranking step. It doesn't update the log-likelihood and only the ranking. Would this explain the final ranking?

lima1 commented 3 years ago

Yep. But again, I don't think you would ever get a "NON-ABERRANT" flag when you simulate the SNPs more realistically. Let me know when you do!

tinyheero commented 3 years ago

Great to solve that mystery.

Regarding the SNP data, it's a bit more complex. We are using simulated sequencing data and we are indeed simulating both SNPs. But we do a denoising step and then use mirrored BAF as input. The mirrored BAF explain why our allelic fractions are never < 0.5. How the additional denoising step is playing a role is unclear to me. Regardless, it would suggest that these two steps are negatively impacting PureCN.

lima1 commented 3 years ago

You would not be able to do this denoising step in real data, right? Before the segmentation step at least?

The mirroring probably has little impact, but because you remove noise, and PureCN expects noise, the SNPs fit the 0.55 solution reasonably well. If you would keep the noise, you would have more SNPs more far away from the 0.5 when you have CNAs (thus the real value different from 0.5). That will lower the likelihood of the wrong solution.

Combining the two models, coverage and SNPs, is quite frankly also a bit ad hoc and empirically tuned to make it work well across a wide range of very different datasets. Not sure how the simulation affects this combined ranking, but would assume it lowers the impact of SNPs as well since they are so clean (low ranking is mostly due to very poor fitting).

tinyheero commented 3 years ago

You would not be able to do this denoising step in real data, right? Before the segmentation step at least?

No we've implemented a method to do the denoising before segmentation to help clean up the input signal. It's good to know that PureCN expects noise. It would suggest that the denoising is actually making things worse. Is there a way to tell PureCN to not expect noise?

lima1 commented 3 years ago

Fair enough, I assume this will be part of your benchmarking anyways.

Whatever you are doing to the tumor VCFs, do the same with your normals. Then merge all the normals, either in a multi-sample VCF or GenomicsDB. The latter however requires additional dependencies in PureCN (see Best Practices vignette). Then use the NormalDB.R script to generate a mapping bias database (again, see BP vignette). Finally, set —model betabin in PureCN.R and provide the generated mapping bias RDS file.

Good luck!