Setting purity=1 - Githubissues

mskcc / facets

Algorithm to implement Fraction and Copy number Estimate from Tumor/normal Sequencing.

144 stars 67 forks source link

Setting purity=1 #174

Closed RoniHaas closed 2 years ago

RoniHaas commented 2 years ago

Hello,

I was wondering if there is an option to define the purity as 1. If yes, could you please share how?

Thank you very much!

veseshan commented 2 years ago

If tumor purity is 1 then any segment with LOH (loss of heterozygosity) will have one of the allele missing. This will blow up the odds ratio and can cause computational issues. I have never encountered a sample with purity of 1. But if you have one I would like to see if the algorithm works or crashes. Thanks.

RoniHaas commented 2 years ago

Thank you for your answer!

I am using prostate tumor cell lines, and it is hard to believe that these samples aren’t close to 100% purity.

These are WGS samples. My normal samples are unmatched random blood samples, with about the same coverage.

When I used FACETS with the code below, the resulted purity was too low for cell lines, 0.3 - 0.5.

rcmat <- readSnpMatrix(sample.pileup);
rcmat[which(rcmat$Chromosome == '23'), 1] <- 'X';
xx <- preProcSample(
    rcmat,
    snp.nbhd = 2500,
    unmatched = TRUE,
    het.thresh = 0.1
    );
oo <- procSample(
    xx,
    cval = 1500
    );
fit <- emcncf(oo)

I am happy to try the solution of purity=1 and let you know if the algorithm works or crashes.

If you could suggest how to change the code to set purity=1 that would be very helpful. Thanks!

veseshan commented 2 years ago

Can you provide an example figure? Cell lines are notoriously fragmented and noisy.

RoniHaas commented 2 years ago

Sure, thank you for the help!

ploidy 2.8 purity 0.36

Just one note about that. I did a small modification to your code. With --unmatched there is a requirement for tumor read coverage >=50.

That created lots of "gaps" where no SNPs are presented in the logOR panels. Since it was hardcoded, I changed the code manually locally for coverage >=25. That solved the problem (example below).

E.g. Before my change with the original code

E.g. using my code with coverage >=25:

veseshan commented 2 years ago

Cell lines are always problematic. For instance you can see both chr1p and chr3p have both balanced (log-odds-ratio close to 0) and imbalanced (log-odds-ratio around +/-2) interspersed throughout. This doesn't make any sense and the algorithm goes somewhere in the middle. I am sorry I don't have a good solution for weird scenarios like this.

RoniHaas commented 2 years ago

Thank you, I appreciate your help