mskcc / facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.
144 stars 67 forks source link

Purity estimate =NA despite copy numbers present #146

Open GeorgetteTanner opened 5 years ago

GeorgetteTanner commented 5 years ago

Hi I've noticed that the majority of my samples (WES, ~60-250x) are being predicted to have ploidy=2.00, purity=NA. I understand that this can happen when there's a lack of copy numbers but my samples seem to have a reasonable amount of these. Is there anything I could try to fix this? I've copied an example below. Thanks. image image

shenmskcc commented 5 years ago

the hyper-segmentation can affect downstream copy number and purity estimation.

On Thu, Nov 14, 2019 at 11:28 AM Georgette Tanner notifications@github.com wrote:

Hi I've noticed that the majority of my samples (WES, ~60-250x) are being predicted to have ploidy=2.00, purity=NA. I understand that this can happen when there's a lack of copy numbers but my samples seem to have a reasonable amount of these. Is there anything I could try to fix this? I've copied an example below. Thanks. [image: image] https://user-images.githubusercontent.com/33323628/68875744-3ee56380-06fb-11ea-9b2f-76e8857f0a41.png [image: image] https://user-images.githubusercontent.com/33323628/68875773-47d63500-06fb-11ea-9f3b-8b632d934c16.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mskcc/facets/issues/146?email_source=notifications&email_token=AC7CKVGH77IPXFHRZKU64G3QTV4DDA5CNFSM4JNOH25KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZL33VA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7CKVDUTVGZ3QYSSVMJK6TQTV4DDANCNFSM4JNOH25A .

veseshan commented 5 years ago

Increase cval to dampen the hypersegmentation issue and the rest should work better

DarioS commented 4 years ago

How can the end user know if they have chosen an appropriate cval for their dataset? How to judge that?

b-niu commented 4 years ago

Also care about the setting of cval value. How can we determine which cval value is most appropriate? Is it best practice to use different values for each sample?

Indianhedgehog commented 1 year ago

The issue still remains even if you use the Cvalues suggested in the vignette!! Processed multiple WGS samples, and all of them have ploidy of 2 and purity "NA"

veseshan commented 1 year ago

@Indianhedgehog - are your samples still hyper-segmented even with high cval? That may be one contributing factor

Indianhedgehog commented 1 year ago

Hi @veseshan, thank you for your reply...what is the upper limit for hypersegmented samples? I just finished an analysis with cval 500 and it seems to work....however the purity-ploidy scores are nowhere close to previously determined values (with another software). Also, I am not interested in determining the purity-ploidy scores... can I just ignore that and consider the segments for further cnv analysis? Thanks again !!

veseshan commented 1 year ago

Can you post a plot (like one at the top of this thread) of a sample where the estimate from facets is nowhere close to previously determined value (will be helpful to know what those are too). You can ignore ploidy and purity if you are interested only in the relative copy numbers. If segments with allelic balance is called normal diploid (1+1) versus balanced quadraploid (2+2) then all the other copy numbers change too.

Indianhedgehog commented 1 year ago

Sure!! H021-2RH2K9_buffy_coat02_tumor cnv

H021-J8TBFU_buffy_coat02_metastasis cnv

veseshan commented 1 year ago

Both samples are hyper-segmented but tumor content is high in both. The estimates for the bottom panel seem appropriate (clear signal in chromosome 2, 7p and 10). The estimates for the top panel is wrong because it is estimating normal diploid to be much higher than it should be resulting in chromosome 13 being called homozygously deleted (whole chromosome cannot have 0 copies - too many genes lost to have a viable cell). dipLogR for that sample should be close to zero and not at the purple line. You can try increasing cval to say 1000 and see if dampen the hyper-segmentation. For WGS 500 is too low especially if coverage depth is low, say 50-80x.

Indianhedgehog commented 1 year ago

Thank you for your input. I reanalyzed the samples with cval 1000...but the dipLogR still remains the same. I also checked the average coverage of the sample on the top panel.....it is around 86. How should I proceed?
H021-2RH2K9_buffy_coat02_tumor cnv

veseshan commented 1 year ago

In the procSample call you can specify dipLogR. The automated selection gets tripped by something in the data (can't figure it out without the data). You can specify the value. From the figure 0 maybe a reasonable value. You can see what happens when you do that. By the way what value of snp.nbhd did you use in preProcSample? The default of 250 is too low for WGS. You can try 1000.

Indianhedgehog commented 1 year ago

I did increase the snp.nbhd to 1000..but the results remain the same. With diplogR to 0.. and snp.nbhd to 1000...here is the result. H021-2RH2K9_buffy_coat02_tumor cnv

I had some questions regarding the cval and also wanted some advice before applying it to the hundreds of samples.

1) I saw in some of the other posts people applying cval = [50, 1000]. I could not completely understand what are these two values and how these values affect segmentation.

2) With dipLogR = NULL facets is able to correctly estimate for some of the samples. Initially, I am planning to run the samples with "NULL" value. After that, individually I will check the log ratio plots and manually process the samples for which it is not able to estimate precisely. In order to process the sample correctly, should I always follow the purple line close to the red dots throughout the chromosome? Please correct me if i am wrong!!

Thank you so much Rajesh

veseshan commented 1 year ago

The zero CN estimate of zero for 11q and 13 still are not realistic. Can you show what the naive estimates are (i.e. no emcncf call just cf-naive prior to that).

veseshan commented 1 year ago

On closer look this sample has features that may have tripped up the algorithm. Note that chromosome 10q and 13 have log-ratio that are close but their log-odds-ratio are vastly different. So total copy number cannot be 1 for them - the more logical fit is total copy number of 3 with 2+1 for 10q and 3+0 for 13. Thus diploid state is not 0 but at a level even lower - something consistent with 4 at log-ratio of zero and 3 at the log-ratio level for these segments, You can try that as dipLogR and see what the estimates are. If you can share some version of this data (consistent with all data privacy regulations) it will help improve the algorithm.

Thanks, Venkat

Indianhedgehog commented 1 year ago

Hi @veseshan , Sorry for the delay. Had to clarify the data-sharing policy. You might have received a next cloud link in your email (seshanv at mskcc dot org). Please let me know if you haven't. It's the same sample on the top panel. (Also, please delete the data once you are done with the analysis)

Regarding, the plots..thank you for pointing out the issues. I totally overlooked them. Please let me know if you have any other suggestions.

Thank you Rajesh

veseshan commented 1 year ago

I got the data. Thanks.

veseshan commented 1 year ago

I have identified some issues with the data. DM is better to discuss data specifics.

Indianhedgehog commented 1 year ago

Sure. This is my mail ID rajesh dot pal at dkfz-heidelberg dot de