How to resolve variability in results due to different random seeds?

ehhfy commented 2 months ago

Thanks for creating this tool! The workflow is clear and easy to follow.

I've noticed some concerning variability when I set different random seeds. I understand from #53 that preProcSample randomly selects SNPs from the input matrix, which affects downstream results. I ran 10 tests on the same SNP matrix file with seeds set from 1 to 10 and collected summary statistics after running the whole workflow. The results are below:

n_snps n_hets pct_hets n_segs dipLogR purity ploidy 1 422984 33283 0.07868619 386 -0.8568592 0.8300489 3.954321 2 422849 33294 0.07873733 350 -0.9517581 0.8853103 4.110511 3 422796 33265 0.07867861 368 -0.9274981 0.8898704 4.027205 4 422947 33281 0.07868835 356 -0.6327828 0.9080758 3.212570 5 422948 33294 0.07871890 335 -0.9326662 0.8872234 4.048639 6 422914 33272 0.07867321 363 -0.9533264 0.8823996 4.122240 7 422891 33281 0.07869877 357 -0.9079667 0.9072686 3.931950 8 422910 33294 0.07872597 358 -0.6944477 0.9000032 3.373917 9 422837 33274 0.07869226 352 -0.7299356 0.9003956 3.462835 10 422731 33292 0.07875457 351 -0.9501981 0.8891565 4.096679

The minimal variation in the first 4 columns makes sense to me as a result of randomizing the SNP selection. However, the range of values in dipLogR, purity, and ploidy are somewhat concerning (similar to the results noted in #59). The range of ploidy values is hardest to interpret; I'm not sure how confident to be in any given result when they vary so much.

My instinct is to use these random tests to infer a distribution of values for purity and ploidy, and report the mean or median as the most likely result. Any other suggestions for rationalizing these differences?

On the surface, these ranges may not seem that significant, but I'm also noticing pretty big differences in the CN/CF plots with different random seeds (e.g., large chromosomal segments being called at very disparate CNs across different iterations). I can share these plots if that would be helpful. The resulting spider plots are pretty similar overall; does it make sense to use those to select the "best" fit and consider that iteration the most likely "true" result for downstream analysis? Just want to make sure I'm making accurate inferences in light of this apparent variability.

Any insights would be greatly appreciated! Thanks.

veseshan commented 1 month ago

The issue maybe due to the sample rather than the procedure. It would help if you show the plot for one of the runs.

ehhfy commented 1 month ago

Thanks for the response. I've attached GIFs looping through some example plots below (all from the same tumor/normal pair, run with 10 different seeds for SNP sampling). The average ploidy values cluster around 1.8 or 3.6, which suggests to me that the program is struggling to pick an appropriate value for dipLogR. The differences between these two situations are most prominent in chrs 6, 7, and 15 (regions called as LOH with CN 1:0 vs. non-LOH with discrepant amplifications). However, there are also substantial differences between runs which infer similar average dipLogR and ploidy values; for example, chr 7p appears to be called with CN 1:1 in some iterations and wholly deleted in others. The spider plots seem reasonably well-fit for all runs, so it's not immediately clear how to select the "best" result. Any thoughts would be appreciated!

CBX163-1 P1_cn CBX163-1 P1_spider

For reference, here are the statistics for the 10 runs of this sample (different than the table in my initial post above):

   n_snps n_hets   pct_hets n_segs    dipLogR    purity   ploidy
1  308300  20208 0.06554655    134  0.2084015 0.9623717 1.720473
2  308304  20208 0.06554570    144  0.1369202 0.9712002 1.813547
3  308305  20208 0.06554548    141  0.2114481 0.9630298 1.716873
4  308301  20208 0.06554633    138  0.2099016 0.9647189 1.719289
5  308304  20208 0.06554570    144 -0.7680490 0.8688760 3.618103
6  308305  20208 0.06554548    136  0.1365152 0.9703538 1.813911
7  308304  20208 0.06554570    138 -0.7525340 0.8640517 3.584973
8  308305  20208 0.06554548    145 -0.7499209 0.8550554 3.594518
9  308306  20208 0.06554527    141  0.2101310 0.9650685 1.719105
10 308306  20208 0.06554527    141  0.2101587 0.9636955 1.718671

veseshan commented 1 month ago

Runs 5, 7 and 8 seem different from the rest and don't seem right. You should look what the flags are for each fit. You are not specifying the dipLogR being used in the fit when you generate the spider plot. Without it, it is hard to assess the quality of the fit.

ehhfy commented 1 month ago

Hi again,

I agree; runs 5/7/8 are definitely the outliers. I should mention that the GIFs are looping through the 10 iterations in order of increasing ploidy; the actual order of runs is (3, 10, 9, 4, 1, 2, 6, 7, 8, 5). All of the emflags are NULL (unlike for some of the other samples I've tested, for which Noisy sample, Calls can be unreliable. is a common flag).

You're correct about the spider plots—I accidentally generated them with dipLogR = 0 for all, which isn't useful. Below are the correct spider plots with dipLogR values adjusted for each run:

CBX163-1 P1_cn CBX163-1 P1_spider

Now it's much clearer to see the difference between the first 7 runs (largely similar CNs inferred) and the last 3 runs (the outliers). Do you have any suggestions for general principles to use in evaluating the spider plots/selecting the most reasonable CN profile?

mskcc / facets

How to resolve variability in results due to different random seeds? #201