lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
127 stars 32 forks source link

Multi-sample CNV segmentation and events #60

Closed tedtoal closed 5 years ago

tedtoal commented 5 years ago

I'm working with multiple samples per person, and want to identify CNV events in common between samples. I started by just trying to identify CNV segment calls that were in common. One problem with that is that the segment boundaries might be close but not identical, and I'd really like to use multi-sample segmentation to align boundaries between samples. However, it is not so easy to do that even though PureCN provides for external segmentation, because I essentially need to run the first part of PureCN on all the samples, then run the multi-sample segmentation, then run the last part of PureCN. It would be nice if PureCN.R could be broken up into more individual steps so that that would be possible. I ended up simply adjusting the position of segment edges when they were closer than some distance (I used 500 Kbp) to the midpoint between the two. Not ideal but I think not too bad of a compromise.

Also, I decided that CNV segments are not good representations of CNV events. For example, two events, one in the middle of the other, would create 3 segments. I tried finding software that could analyze PureCN segment calls and produce CNV event calls, but I found nothing. Then I had an idea that I believe is reasonably sound, but want to run it past you to see if you see any big problems with it. The idea is that generally one would not expect the edges of two separate CNV events to align, or at least not often. So, each edge of a CNV segment could be taken as one CNV event, which might overcount CNV events by a factor of 2, but I think that is less of a problem than trying to untangle common CNV segments between samples when other CNV events might intervene in some of those samples.

In addition, I wanted to do a statistical test of the markers on each side of each segment edge, in EACH SAMPLE whether or not a segment edge was called at that position. So, I took all PureCN markers on each side of each segment edge, up to either the segment boundary on the other side, or 5 Mbp, whichever was smaller, adjusted their copy ratios for purity and ploidy, then tested the two sets of copy ratios for a difference in means greater than 0.1. I was hoping for a clear separation of each sample into YES edge is present or NO edge is not present, but it seems that there are a lot of cases where the p-value for edge present is not all that small but is small enough to not allow me to say the edge is not present. This might be because of subclonal presence of the CNV at a small frequency in the samples where it seems to be not present. I'm working now on examining results of this. Do you see any big problems with this method? If the method is reasonable, it would be nice if PureCN had an "edge" output, with p-values.

lima1 commented 5 years ago

Are you referring to the new multi-sample segmentation described in section 10.1.3 of https://bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/PureCN.pdf?

There is no command line tool yet, but it does not need the output of PureCN.R. Run Coverage.R on all tumor samples, and then call processMultipleSamples in R for each patient. You can then concatenate all of the output segmentations into one file that includes all samples. Then simply run PureCN.R as you normally would, simply add --segfile and --funsegmentation Hclust.

In general, all breakpoints PureCN reports are highly significant, but technical issues can affect many probes and thus resulting in low p-values. DNAcopy is doing a pretty good job (meaning it's probably hard to clean up downstream), but in your case, the multi-sample segmentation might indeed provide significant improvements.

tedtoal commented 5 years ago

No I wasn’t, hadn’t seen it. That’s great!

However, it looks to me like that segmentation provides the same breakpoint across all samples, whether or not a particular sample has a copy number breakpoint at that position. If I input that segmentation into PureCN.R, what is it going to do with that?

One reason I wanted to test breakpoints myself even though PureCN calls should be highly significant is that samples in which it is not called might nevertheless have a change in copy ratio at that point, just not reaching the point of significance. Or, they might NOT, and it would be nice to know that a test of the copy ratios on each side of the breakpoint showed no significant change in value. (Three possibilities: sig change, sig. no change, unable to call).

With the multi-sample segmentation in multipcf(), I would still need to test each sample at each edge, since it apparently now puts a segment edge into a sample whether the copy ratio changes or not.

ted

— Ted Toal, Postdoctoral Researcher Carvajal-Carmona Lab Dept. of Biochemistry and Molecular Medicine 4502 GBSF, One Shields Ave Davis, CA 956626 (530) 263-5986 twtoal@ucdavis.edu

On Nov 19, 2018, at 1:19 PM, M. Riester notifications@github.com wrote:

Are you referring to the new multi-sample segmentation described in section 10.1.3 of https://bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/PureCN.pdf https://bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/PureCN.pdf?

There is no command line tool yet, but it does not need the output of PureCN.R. Run Coverage.R on all tumor samples, and then call processMultipleSamples in R for each patient. You can then concatenate all of the output segmentations into one file that includes all samples. Then simply run PureCN.R as you normally would, simply add --segfile and --funsegmentation Hclust.

In general, all breakpoints PureCN reports are highly significant, but technical issues can affect many probes and thus resulting in low p-values. DNAcopy is doing a pretty good job (meaning it's probably hard to clean up downstream), but in your case, the multi-sample segmentation might indeed provide significant improvements.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/60#issuecomment-440045172, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXJzzKhgpRnjqa3_ZdmDOUDeLeFrH8Tks5uwyBkgaJpZM4Yp44L.

lima1 commented 5 years ago

The segementationHclust function will cluster segments and join consecutive ones when they are in the same cluster. It’s not perfect, but should clean it up a bit.

tedtoal commented 5 years ago

Is that function part of PureCN?

I need to update PureCN, but have been holding off. I’m working with 1.10.0.

— Ted Toal, Postdoctoral Researcher Carvajal-Carmona Lab Dept. of Biochemistry and Molecular Medicine 4502 GBSF, One Shields Ave Davis, CA 956626 (530) 263-5986 twtoal@ucdavis.edu

On Nov 19, 2018, at 4:10 PM, M. Riester notifications@github.com wrote:

The segementationHclust function will cluster segments and join consecutive ones when they are in the same cluster. It’s not perfect, but should clean it up a bit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lima1/PureCN/issues/60#issuecomment-440088368, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXJz5gq9xsxd_wg7dagdfE2gXJUkGC1ks5uw0hfgaJpZM4Yp44L.

lima1 commented 5 years ago

You'll need a recent GitHub or Bioconductor devel for that. There weren't any changes to the likelihood model since 1.10 or in general no major changes, so should be a smooth update.

See https://github.com/lima1/PureCN/blob/master/NEWS