lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Processing noisy FFPE samples #193

Closed naumenko-sa closed 3 years ago

naumenko-sa commented 3 years ago

Hello Markus!

Thanks for your work on PureCN!

Describe the issue I'm trying to get copy numbers for a number of FFPE samples, both control samples and tumors are quite noisy. Do you have any suggestions for this case, or it is just hopeless and the purity estimation is the only result I could expect to get from this data?

To Reproduce Below is interval_weights_hg19.png from NormalDB.R interval_weights_hg19

For tumors, mean sd of log-ratios ranges from 0.21 to 0.63, with the most samples >0.3. Purities are from 0.16 to 0.89, with the median of 0.5.

Expected behavior Even for the best Tumor samples with mean sd of LR = 0.21 or the highest purity 0.63, there are no C values in sample_amplification_pvalues.csv. The hope is to calculate them with some parameter tweaking.

Log file INFO [2021-08-06 03:05:50] PureCN 1.23.5 INFO [2021-08-06 03:05:55] Mean target coverages: 121X (tumor) 116X (normal). INFO [2021-08-06 03:06:14] Using 404944 intervals (382242 on-target, 22702 off-target). INFO [2021-08-06 03:06:14] Ratio of mean on-target vs. off-target read counts: 0.52 INFO [2021-08-06 03:06:14] Mean off-target bin size: 90852 INFO [2021-08-06 03:06:15] AT/GC dropout: 1.07 (tumor), 1.06 (normal), 1.02 (coverage log-ratio). INFO [2021-08-06 03:06:48] Total size of targeted genomic region: 116.39Mb (132.64Mb with 50bp padding). INFO [2021-08-06 03:06:49] 18.2% of targets contain variants. INFO [2021-08-06 03:30:24] Interval weights found, will use weighted PSCBS. INFO [2021-08-06 03:30:24] Setting undo.SD parameter to 1.500000. INFO [2021-08-06 03:32:22] Found 1556 segments, exceeding max.segments threshold of 300. INFO [2021-08-06 03:32:22] Setting undo.SD parameter to 2.250000. INFO [2021-08-06 03:33:55] Found 820 segments, exceeding max.segments threshold of 300. INFO [2021-08-06 03:33:56] Setting prune.hclust.h parameter to 0.150000. INFO [2021-08-06 03:34:00] Found 820 segments with median size of 0.20Mb. INFO [2021-08-06 03:34:00] Using 69343 variants. INFO [2021-08-06 03:34:06] Mean standard deviation of log-ratios: 0.23 INFO [2021-08-06 03:34:06] Mean standard deviation of on-target log-ratios only: 0.22 INFO [2021-08-06 04:09:26] Mean standard deviation of off-target log-ratios only: 0.18 INFO [2021-08-06 05:23:08] Tumor/normal noise ratio: 1.250

So far, I've tried:

Is there anything that could be one to get C numbers and maybe some deletions (the results are only Amplifications with C=NA).

Thanks! Sergey

lima1 commented 3 years ago

Hi Sergey,

You want to look at sample_genes.csv.

sample_amplification_pvalues.csv is mostly for cfDNA. We use it as strictly second pipeline for samples below 10% ctDNA. This algorithm is similar to what companies like Guardant do (or did, they might have changed by now though), you simply calculate z-scores of coverage increase compared to pool of normals. If this is significant and an outlier in the genome (in higher purity, even a single gain is significant, but wouldn't be an outlier in the genome), only looking at known amplification, we believe it's a real amplification.

This sample looks pretty good. I assume that's a WES + promoter etc. assay? You should increase maxsegments in this case, because you have a huge number of data points (that's almost 40x our assay).

Don't think off-target adds a lot for you, but with the latest 1.23.8 they are automatically ignored if they aren't helpful.

0.5 is bad, but happens with FFPE.

Markus

naumenko-sa commented 3 years ago

Thanks, Markus, there are C values in the sample_genes.csv!

This is just Agilent SureSelect6 exome, so it has more segments than a panel.

Does it still make sense to force larger segments here with --undosd 1.5? The difference in the end result is, for example, 4 deletions instead of ~120 gene-level deletions, which is easier to review and maybe they are more trustworthy?

lima1 commented 3 years ago

Oh, so you did a padding on the Agilent baits BED file before feeding it to IntervalFile.R? It's fine I guess (and other tools do that). It's just that the padding you specify in PureCN.R is on top of that. Edit: did it improve the noise compared to no padding?

Whatever works! With cfDNA, what I did in the past was create a blacklist of deletions/amplifications in normal samples or samples with little shedding. Tissue is a bit more difficult, but you can still list all recurrent calls and blacklist if those make no sense. There is an open issue for that.

naumenko-sa commented 3 years ago

Yes, I have done 100bp padding both ways and pureCN is doing 50bp on top of that.

I don't see much difference for interval_weights for non-padded and padded intervals (this is from NormalDB.R): Non-padded: interval_weights_hg19 not_padded

padded interval_weights_hg19

In terms of MAPD (http://tools.thermofisher.com/content/sfs/brochures/mapd_snp6_whitepaper.pdf, low noise is low MAPDs ~ 0.05), there is no improvement as well.

sample not-padded padded
T_001 0.41 0.5
T_002 0.53 0.57
T_003 0.61 0.6
T_004 0.52 0.48
T_005 0.59 0.56
T_006 0.77 0.84
T_007 0.61 0.62
T_008 0.51 0.48
T_009 0.88 1.02
T_010 0.7 0.64
T_011 0.68 0.64
T_012 0.45 0.53

Mean sd of log-ratios of T samples:

sample not-padded padded
T_001 0.26 0.28
T_002 0.29 0.30
T_003 0.30 0.32
T_004 0.29 0.31
T_005 0.31 0.33
T_006 0.38 0.39
T_007 0.33 0.32
T_008 0.26 0.28
T_009 0.39 0.39
T_010 0.32 0.33
T_011 0.31 0.32
T_012 0.28 0.29

I have to also mention that it was not only the padding that changed between the two runs, but also:

Thanks for the advice with the stop-list, I see the issue: https://github.com/lima1/PureCN/issues/130

lima1 commented 3 years ago

Awesome, that confirmed my intuition, but was something I wanted to benchmark for a while since many tools use something like 200bp padding as default. But at least for high coverage data, I don't think the increase in counts decreases variance anymore, but the padding might make it more vulnerable to other biases introducing more variance.

The SD's don't look too bad for FFPE.

Let me know if you see any patterns in your artifacts and I can work on improving the flagging (probably after the fall release, I try to make the next one mainly a bug fix release). With our small panels, that just wasn't a big deal so far, but came recently up here as well since a new assay had a completely different artifact profile.

lima1 commented 3 years ago

Related issue is #179