lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

any plan for amplicon sequencing data? #121

Closed dauss75 closed 4 years ago

dauss75 commented 4 years ago

Hello, I have been exploring and testing PureCN for our amplicon based targeted WES data for cancer patients data. While I was quite impressed by the clear documentations and implementations, I also hoped PureCN may work for our data; however, I've lost my hope in the end. :(

I am working on predicting somatic/germline mutations from tumor only samples although we have matched tumor/normal (T/N) samples, which we consider as a gold standard, to leverage a performance of programs. We have enough normal samples to create panel of normals, but when looking at posterior somatic probability after PureCN the expected somatic mutations identified from T/N were all over the places. I understand PureCN acknowledges that it's intended to use for hybrid capture, but was wondering if there were any plans to implement further the PureCN for the amplicon data.

Best, Segun

lima1 commented 4 years ago

Hi Segun,

the measured allelic fractions in hybrid capture are a pretty good approximation of the true fractions, which makes the T/N classification possible. I never worked with amplicon data, but my understanding is that these can be very off in amplicon data without UMIs. Can you share a plot of SNP allelic fractions of a tumor sample with lots of CNAs, similar to https://scfbm.biomedcentral.com/articles/10.1186/s13029-016-0060-z/figures/4?

Markus

dauss75 commented 4 years ago

Hi Markus, Thanks for the note.

We generate data with UMI so I guess one concern is at least off the table.
Please see the figures below from two samples that we received from FOC.

image

image

Just realizing you have a lot more data points (mutations) in the paper than what I generated that may be due to the different target panel size (ours is ~1Mb). Would this be a concern as I think I read (from the paper or vignette) noting at least 1000 mutations in a vcf is recommended?

A little more info about our data and the purecn parameters.

data: high coverage (> 500X)

tried w/ and w/o --keepduplicates since we use UMI -> same result

--model=betabin not used: --offtarget

Thanks again for looking into this problem!

Segun

lima1 commented 4 years ago

Yes, 200 variants is very low. I assume that's already the maximum you can get, i.e. increasing the padding won't help much. We get > 2000 variants with 3Mb and 75bp padding. Without padding, it's 1000 iirc, which would be still more than you get. Apart from that it does indeed not look too terrible.

How many normal samples do you have? Did you generate a mapping bias file?

dauss75 commented 4 years ago

Before and after filtering, it's ~750-1400 and ~300-400 variants, respectively. Maybe I get only ~200 variants as I feed --minaf=0.1 to pureCN.R.

To simplify the experiment, I picked up four samples. I did run the four matched T/N samples through pureCN and it picked up all the somatic mutations correctly with a few false positives (germline). For T only, I created the normalDB using only the four matched normal samples, but the results are way off from the T/N.

There were some differences in the purity and ploidy between T/N and T, but not sure if this matters since s3 and s4 T only mutations didn't predict the somatic mutations correctly.

  purity purity ploidy ploidy
samples T/N T T/N T
s1 0.76 0.44 5.30 3.02
s2 0.34 0.56 4.71 2.03
s3 0.15 0.15 2.20 2.22
s4 0.55 0.56 2.05 2.05

We have 28 normal samples from different projects (ours, FOC, etc). So far, I have used subset or all of the samples to create a various PoN, but no luck so far.
I didn't generate a mapping bias. Would this help? In any case, I will give a try.

lima1 commented 4 years ago

Since SNPs should be > 0.1, you should be fine. Filtered by PureCN? Can you share a log file?

Yes, mapping bias should help a lot. It will find noisy SNPs and ignore them. Future version will also use the mapping bias file more efficiently by optimizing the beta-binomial allelic fraction distributions.

dauss75 commented 4 years ago

Filtered by PureCN? -> sorry about the confusion. I meant to say "filtered by our pipeline" taking into account ref/alt, AF, artifacts. Can you share a log file? -> I wrapped your program in a python script for parallelization and didn't bother to save the output from R, but I will make one and send you soon. Perhaps via email? please let me know. Thanks!

lima1 commented 4 years ago

Yes, email works. Thanks.