Cobalt issue with read counting in PacBio HiFi reads

proteinosome commented 8 months ago

Dear developers, I've been testing out HMFTools suite on PacBio HiFi sequencing dataset and it looks like straight out of the box Purple is able to estimate purity and ploidy very well for in-silico titrated cell lines (HCC1395 and COLO829), which is great!

However, upon closer examination the CNV segmentation isn't working very well. When I looked at Cobalt output, it looks like the read counts were pretty off. I'm running this on a 60X tumor/normal dataset but the median read count in 1 kbp bins were only 2 and 4 respectively. May I check with you if you know why Cobalt is counting the reads incorrectly? FWIW The depth output from Amber actually looks good.

Thank you.

p-priestley commented 8 months ago

We have never considered longer reads previously. COBALT use 1kb bins, based on the starting base of each read. For long reads, this will not work well as they will be sparse. Better of course would be for COBALT to count the average coverage in each 1kb window. I will look into this change and revert

Amber simply counts the depth at each base so there is no isssue.

proteinosome commented 8 months ago

Thank you @p-priestley that sounds good to me! Yeah if that's how the counting is done I could see why now. I think with that change the Purple suite should work really well. It has worked surprisingly well even with the count being far off. In COLO829 the somatic CNV segmentation actually looked good once Purple does a joint segmentation of Cobalt + Amber output.

hongwingl commented 7 months ago

Hi this count depth change has been committed to the code. We will test more and make releases soon.

proteinosome commented 7 months ago

Thanks @hongwingl. I did a couple of tests and indeed the depths are calculated correctly. However, there are now way too many segments likely due to the relatively high depth fluctuation from 1kbp windows. Setting gamma to 200 I'm getting 4k PCF segments in germline samples.

With CNVKit previously I've had to set the bin size to 10 kbp for long-reads, and I think we're probably running into similar issue here. Not sure if there's any straightforward way to try fixing this, but thanks for working on the depth count issue!

p-priestley commented 7 months ago

4000 segments for a germline sample does not sound ridiculous to me. That is inline with roughly what we would expect from the number of deletions and duplicaitons in the germline in a typical sample

proteinosome commented 7 months ago

Thanks @p-priestley ! I guess I had been used to seeing hundreds of segments at most in germline from other CNV caller where generally only very large events are called as segments. I'll do more investigation into the segments to see if I can make sense of them.

proteinosome commented 7 months ago

Here's a Circos plot of HCC1395(BL) from Purple FYI to show the high RD/BAF fluctuation. Do you think there's probably a need to smooth out the points?

p-priestley commented 7 months ago

The input looks very noisy compared to what I am used to, but I have never looked at PacBio data. I am also unfamiliar this cell line. Are you able to post the pictures for COLO829T also as I know that much better.

Is that with gamma set to 200 already? Perhaps you could set it higher (eg. 1000?)

Lastly, what is the depth here roughly? COBALT should consolidate to larger windows automatically if the average depth of coverage <8 ( https://github.com/hartwigmedical/hmftools/blob/master/cobalt/README.md#depth-window-consoldiation)

proteinosome commented 7 months ago

Long-reads have much lesser number of reads compared to short-reads at equal coverage, and I think that might be the reason why the read depths and BAF are relatively noisier with small bin windows.

Here's COLO829 with gamma 200 and gamma 1000:

COLO829 gamma200 circos

COLO829 gamma1000 circos

Both are 30X tumor, 30X normal.

I noticed as well that Purple is calling 100% purity and 1.5 ploidy by default but the solution is just barely better than 99% purity and 3.0 ploidy (which should be the correct ploidy). I wonder if that is due to the noisy nature or is that also seen in short-reads dataset?

p-priestley commented 7 months ago

Here is our result from high quality short read sequencing

p-priestley commented 7 months ago

Your 1000 result looks ok. Different noise characteristics may lead to different fits. There are various paratmers in PURPLE which can control this if you are really interested.

p-priestley commented 7 months ago

The input.png is MUCH cleaner for short read data:

See here:
https://github.com/hartwigmedical/hmftools/tree/master/purple#circos

proteinosome commented 7 months ago

Thanks @p-priestley for sharing. I think that's what I expected, too, in terms of the noise from short-reads based on some of my past observations. Perhaps a larger bin windows option for long-reads would be useful to smooth out the noise. That might have an effect in masking out focal CNV, but those should likely be captured by long-reads SV caller, anyway.

I'll stick with gamma=1000 for now.

FYI in case you guys are interested to look into HiFi dataset, both the COLO829 and HCC1395 datasets are available on our website: https://www.pacb.com/connect/datasets/

Thank you.

hartwigmedical / hmftools

Cobalt issue with read counting in PacBio HiFi reads #485