uclahs-cds / project-method-AlgorithmEvaluation-BNCH-000082-SRCRNDSeed

GNU General Public License v2.0
1 stars 0 forks source link

Battenberg Analysis #97

Closed philsteinberg closed 1 year ago

philsteinberg commented 1 year ago

Parse num subclones:

Create relative seed comparison (Lydia plot) #99

Mutect2

Strelka2

SomaticSniper

Create box plot num of subclones called mean, sd #101

Mutect2

Strelka2

SomaticSniper

Create plot comparison for num of CCF

Do statistical test on variation in num of subclones called

Other files in pipeline output

Create plot comparison for num of SNVs

Create plot comparison for PGA

philsteinberg commented 1 year ago

BOXPLOTS 2023-04-03_proj-seed_SomaticSniper-Battenberg-DPClust-sr_box 2023-04-03_proj-seed_SomaticSniper-Battenberg-PyClone-VI-mr_box 2023-04-03_proj-seed_SomaticSniper-Battenberg-PyClone-VI-sr_box 2023-04-03_proj-seed_Strelka2-Battenberg-DPClust-sr_box 2023-04-03_proj-seed_Strelka2-Battenberg-PyClone-VI-mr_box 2023-04-03_proj-seed_Strelka2-Battenberg-PyClone-VI-sr_box 2023-04-03_proj-seed_Mutect2-Battenberg-DPClust-sr_box 2023-04-03_proj-seed_Mutect2-Battenberg-PyClone-VI-mr_box 2023-04-03_proj-seed_Mutect2-Battenberg-PyClone-VI-sr_box

philsteinberg commented 1 year ago

@lydiayliu Above are some of the parsed pipeline outputs and boxplots.

For barplots of number of subclones called, would it make most sense to just compare a pipeline's sr and mr outputs per sample?

And should I still recreate your relative seed variability plots? (With the modification of counting the number of times the seed calls more or fewer subclones compared to the rest (rather than just first) seed?

We had talked about quantifying how often a seed calls more or less subclones (mean/sd from the boxplot) than the majority (for both sr and mr so (14-1) + (7-1) = 19 samples) and then creating a permutation test. Any guidance on what type of test would be best for this?

I was also trying to parse the PhyloWGS output with your script and my pipeline outputs but am having some issues with replicating the results. Could you please give me the full directory path for your test files so I can troubleshoot?

lydiayliu commented 1 year ago

This is great!! We are seeing variability (which is awesome), and it would be interesting to see if we can associate those variability with tumour traits (such as number of SNVs, PGA, etc).

For barplots of number of subclones called, would it make most sense to just compare a pipeline's sr and mr outputs per sample?

I think the sr reconstructions and mr reconstructions would be different sections in the paper. So in the mr section we can definitely comment on if we see more or less variability between mr and sr. I woulnd't repeat the data from th sr section in the mr plots though. Also, for the mr plots can you remove the 7 patients that don't have data?

And should I still recreate your relative seed variability plots? (With the modification of counting the number of times the seed calls more or fewer subclones compared to the rest (rather than just first) seed?

I liked that plot because it showed the individual seeds, so we can immediately see if some seeds promoted more subclones (which would be a very interesting and shocking discovery). I think you can change the "baseline" comparison to the "mode" number of subclones observed across the 10 seeds?

We had talked about quantifying how often a seed calls more or less subclones (mean/sd from the boxplot) than the majority (for both sr and mr so (14-1) + (7-1) = 19 samples) and then creating a permutation test. Any guidance on what type of test would be best for this?

I'm not sure I get what you mean by this part "(for both sr and mr so (14-1) + (7-1) = 19 samples)". I think what I envisioned was that for each sample, you permutate the (number of subclones called - median number of subclones) by each seed. For each round of permutation (after you permutate in each sample), you can calculate an "average deviation from median" across all samples for each seed. After 10,000 permutations, you get an "average deviation from median" null distribution per seed. With this null distribution you can get a p-value for whether your observed average deviation from median is significant.

I was also trying to parse the PhyloWGS output with your script and my pipeline outputs but am having some issues with replicating the results. Could you please give me the full directory path for your test files?

Hmm try the following, but copy them to a different directory first

/hot/project/disease/ProstateTumor/PRAD-000005-293PT/PhyloWGS/SomaticSniper-TITAN/CPCG0100/CPCG0100.mutass.zip
/hot/project/disease/ProstateTumor/PRAD-000005-293PT/PhyloWGS/SomaticSniper-TITAN/CPCG0100/CPCG0100.summ.json
/hot/project/disease/ProstateTumor/PRAD-000005-293PT/PhyloWGS/SomaticSniper-TITAN/CPCG0100/CPCG0100.muts.json

Happy to help if you can post specific errors as well! It is going to take a while to digest that script, sorry :P

philsteinberg commented 1 year ago

Get n_clones table from consensus trees SomaticSniper-Battenberg-PhyloWGS-sr run

module load R
Rscript ./parse_num_subclones_PhyloWGS.R \
-i /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-somaticsniper-battenberg-phylowgs/output/consensus_tree \
-o /hot/project/method/AlgorithmEvaluation/BNCH-000082-SRCRNDSeed/pipeline-call-src/run-somaticsniper-battenberg-phylowgs/output \
-p somaticsniper_battenberg_phylowgs_sr