mskcc / facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.
137 stars 65 forks source link

Filtering and interpreting results #62

Open dariober opened 6 years ago

dariober commented 6 years ago

Sorry, I'm bombarding you with questions... I'd like to make sure my understanding is correct.

Typically one wants to filter out segments that are copy-number neutral. For this purpose, is it sufficient to filter out records where tcn.em = 2 or NA and lcn.em = 1 or NA? Is any other filtering recommended? E.g.:

datafile<- system.file("extdata", "stomach.csv.gz", package="facets")
rcmat<- readSnpMatrix(datafile)
xx<- preProcSample(rcmat)
oo<- procSample(xx,cval=150)
fit<- emcncf(oo)
cncf<- fit$cncf

# Uninteresting segments:
cncf[which((cncf$tcn.em == 2 & cncf$lcn.em == 1) | is.na(cncf$tcn.em) | is.na(cncf$lcn.em)),]

Related question, In order to assess the "strength" or enrichment of a segment I guess a useful metric is cnlr.median or cnlr.median.clust (these two appear to be almost perfectly correlated, so I guess it doesn't matter which one is chosen, any comment on this?) However, the raw cnlr.median should be adjusted for the logR of the diploid state (found in fit$dipLogR). Otherwise, with high or low ploidy the raw cnlr.median may be misleading.

For example, in the results plotted below segments with logR ~ 0 are still duplicated, so a useful metric to assess enrichment becomes: fit$cncf$cnlr.median - fit$dipLogR. Do you agree with this?

Thanks a lot again! Dario

facets

veseshan commented 6 years ago

Feel free to bombard away since it could help to improve the package.

You may not want to filter away all the lcn.em=NA segments. There can be clear focal amplifications i.e. large tcn.em but due to the sparsity of het SNPs we can't determine lcn unambiguously.

cnlr.median is the segment specific estimate. For the clust suffix segments that are similar are combined and given a cluster specific estimate. So they ought to be very similar. cnlr.median by itself is not informative since it depends on library size, ploidy etc. As you said the right quantity to use is cnlr.median - dipLogR.

Venkat

dariober commented 6 years ago

...Following this up... What about segments where tcn.em=2 and lcn.em=0?

This could be neutral segments if both tumour and normal are homozygote (hence lcn.em=0), so they could be filtered out. However, some of these could be loss-of-heterozygosity segments where the normal does have lcn.em=1 (instead cases where tcn.em != 2 and lcn.em=0 are definitely interesting).

I think to distinguish which case we have one needs to look at mafR.clust and see whether it is close to zero (neutral segment) or >0 (LOH). Now, I guess deciding when mafR.clust is convincingly > 0 may be tricky as it depends also on the number of markers in the segment (few markers are more likely to generate spurious deviations from 0). Any thoughts...?

Dario

veseshan commented 6 years ago

A segment that has tcn=2 and lcn=0 is always copy neutral LOH because in order to assign lcn the segment should have had enough het snps in the normal. If it didn't lcn assignment would have been NA. You can question whether min.nhet is large enough.

Venkat

dariober commented 6 years ago

Indeed, that makes sense. Let's try to recapitulate. Would you agree with this classification?

tcn.em lcn.em type
0 0 DEL
0 NA DEL
1 0 HEMIZYGOTE
1 NA HEMIZYGOTE
2 0 LOH
2 1 NEUTR
2 NA NEUTR/Unknown
2+ 0 DUP-LOH
2+ 1+ DUP
2+ NA DUP-[LOH?]
veseshan commented 6 years ago

Yes except that tcn=1 automatically implies lcn=0. So (1,NA) shouldn't occur (if it does it's a bug).

dariober commented 6 years ago

Hi - Thanks! I have a few cases where tcn=0 and lcn=NA, 9 out of 8486 (cumulated on several samples). These are the records, if useful:

chrom     start       end seg num.mark nhet segclust cnlr.median.clust   mafR.clust cf.em tcn.em lcn.em    sample_id 
 chrX     29133   2780079 171     3999 1035       20      -0.188145471  0.105767213     1      1     NA    276
 chrX   2780409  31879100 172    26156   31       15      -0.320095316           NA     1      1     NA    276
 chrX  31961182  58438972 174    22373   20       15      -0.320095316           NA     1      1     NA    276
 chrX  58439460  62366300 175      139    0       19      -0.244809231           NA     1      1     NA    276
 chrX  67532300  68324978 177      754    0       15      -0.320095316           NA     1      1     NA    276
 chrX  73961253  75429969 179     1357    0       15      -0.320095316           NA     1      1     NA    276
 chrX 155704979 156030162 181      603  131        8      -0.385180248 -0.001746385     1      1     NA    276
 chrX     21457   2781330  34     4363 1489        9       0.009380662 -0.001925520     1      1     NA    303
 chrX 155703951 156030162  36      622   79       10       0.009380662  0.003004941     1      1     NA    303
veseshan commented 6 years ago

Thanks. Will fix the bug.

brewert2 commented 4 years ago

I have a quick question regarding the ploidy/purity output. What does it mean when it says "75%" such as in " .purity.75%" and/or .ploidy.75%">

Thank you!