diploSHIC preds - Githubissues

andrewkern commented 1 year ago

this PR adds a diploshic sweep detection pipeline to the sweep workflow.

there are a few steps here:

in a diploshic.snake workflow we train a diploshic classifier under a 'vanilla' parameterization.
diploshic feature vectors are calculated from the vcf output of the sweep simulations
diploshic prediction is performed using the trained model on the feature vector files

at the end of the day three new files are created within the individual simulation subdirs, e.g.

results/simulated_data/sweeps/sweep/OutOfAfrica_3G09/CEU/NA/NA/0.03/1/1501552846/
├── sim_chr1_121990000_126990000.diploshic.ancFile
├── sim_chr1_121990000_126990000.diploshic.fv
├── sim_chr1_121990000_126990000.diploshic.preds

the .preds file has the windowed predictions and their probabilities. the lines of those files look like e.g.,

chrom   classifiedWinStart  classifiedWinEnd    bigWinRange predClass   prob(neutral)   prob(likedSoft) prob(linkedHard)    prob(soft)  prob(hard)
1   450001  550000  1-1100000   neutral 0.974178    0.025754    0.000050    0.000012    0.000006
1   550001  650000  100001-1200000  neutral 0.976588    0.023380    0.000025    0.000004    0.000003
1   650001  750000  200001-1300000  neutral 0.942536    0.057163    0.000201    0.000010    0.000090
1   750001  850000  300001-1400000  neutral 0.976987    0.022642    0.000096    0.000037    0.000238

so the coordinates on the simulated chunk (NOT THE CHROMOSOME COORDS) are being output here, for instance the first big window was from bases 1-1100000, 11 subwindows are used, and the center most to be classified occurs from 450001 550000

@mufernando @nspope if you have a chance you check this out? Can this output be aggregated easily enough into what you already have?

andrewkern commented 1 year ago

had a bit of conversation in stdpopsim coding hour with @mufernando and @dschride about how best to summarize diploshic results. Because diploshic does a sliding window prediction, each simulated chunk is associated with ~40 predictions (given the window sizes currently used). One proposal is to summarize the FPR and TPR at each of those 40 locations separately.

andrewkern commented 1 year ago

okay @mufernando i think this is ready. want to take another look?

nspope commented 1 year ago

Looks good! The only thing I'm unclear on is how the output should get aggregated with the other stats. CLR and diversity (and presumably any other "easy to compute" summary stats we decide to add) are put into the same file in dump_results: https://github.com/popsim-consortium/analysis2/blob/5452aade34c867604c83bcbc4298eb0b4fe2c560/workflows/sweep_simulate.snake#L426-L433

should we also parse/postprocess diploshic's output here, append metadata, and write out in the same format as for CLR, etc?

mufernando commented 1 year ago

This ran fine on talapas and everything looks good. I think we can merge this and think about aggregating with the other summaries in a different PR. what do you think @nspope and @andrewkern ?

nspope commented 1 year ago

sure let's follow up with another PR that does the aggregation (& maybe produces plots)

mufernando commented 1 year ago

@nspope and I think we should just get the probability of a sweep within the central window and use that as the summary. And we will need to change what we do with CLR currently (getting the max within the 5Mb) and instead get the CLR for where the sweep happens (the center of the 5Mb window).

andrewkern commented 1 year ago

Sounds good to me.

From: Murillo R. @.> Sent: Friday, June 30, 2023 1:27:47 PM To: popsim-consortium/analysis2 @.> Cc: Andrew Kern @.>; Mention @.> Subject: Re: [popsim-consortium/analysis2] diploSHIC preds (PR #105)

@nspopehttps://urldefense.com/v3/__https://github.com/nspope__;!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7yDnUrtrw$ and I think we should just get the probability of a sweep within the central window and use that as the summary. And we will need to change what we do with CLR currently (getting the max within the 5Mb) and instead get the CLR for where the sweep happens (the center of the 5Mb window).

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/popsim-consortium/analysis2/pull/105*issuecomment-1615162667__;Iw!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7znMD-a3g$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB2RSVKJCLQN5OVWFSORBVLXN4ZEHANCNFSM6AAAAAAZU4BBEA__;!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7xXtdCL-A$. You are receiving this because you were mentioned.Message ID: @.***>

popsim-consortium / analysis2

diploSHIC preds #105