Closed andrewkern closed 1 year ago
had a bit of conversation in stdpopsim coding hour with @mufernando and @dschride about how best to summarize diploshic results. Because diploshic does a sliding window prediction, each simulated chunk is associated with ~40 predictions (given the window sizes currently used). One proposal is to summarize the FPR and TPR at each of those 40 locations separately.
okay @mufernando i think this is ready. want to take another look?
Looks good! The only thing I'm unclear on is how the output should get aggregated with the other stats. CLR and diversity (and presumably any other "easy to compute" summary stats we decide to add) are put into the same file in dump_results
: https://github.com/popsim-consortium/analysis2/blob/5452aade34c867604c83bcbc4298eb0b4fe2c560/workflows/sweep_simulate.snake#L426-L433
should we also parse/postprocess diploshic's output here, append metadata, and write out in the same format as for CLR, etc?
This ran fine on talapas and everything looks good. I think we can merge this and think about aggregating with the other summaries in a different PR. what do you think @nspope and @andrewkern ?
sure let's follow up with another PR that does the aggregation (& maybe produces plots)
@nspope and I think we should just get the probability of a sweep within the central window and use that as the summary. And we will need to change what we do with CLR currently (getting the max within the 5Mb) and instead get the CLR for where the sweep happens (the center of the 5Mb window).
Sounds good to me.
From: Murillo R. @.> Sent: Friday, June 30, 2023 1:27:47 PM To: popsim-consortium/analysis2 @.> Cc: Andrew Kern @.>; Mention @.> Subject: Re: [popsim-consortium/analysis2] diploSHIC preds (PR #105)
@nspopehttps://urldefense.com/v3/__https://github.com/nspope__;!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7yDnUrtrw$ and I think we should just get the probability of a sweep within the central window and use that as the summary. And we will need to change what we do with CLR currently (getting the max within the 5Mb) and instead get the CLR for where the sweep happens (the center of the 5Mb window).
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/popsim-consortium/analysis2/pull/105*issuecomment-1615162667__;Iw!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7znMD-a3g$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB2RSVKJCLQN5OVWFSORBVLXN4ZEHANCNFSM6AAAAAAZU4BBEA__;!!C5qS4YX3!C6EEwlZ-PZkDnNUDj9OQyclvAl7A_FNM53S877fiWJU0IUl1-o7N63P90lXO_JrhHTJQS9F6VN4vlqyyF7xXtdCL-A$. You are receiving this because you were mentioned.Message ID: @.***>
this PR adds a diploshic sweep detection pipeline to the sweep workflow.
there are a few steps here:
diploshic.snake
workflow we train adiploshic
classifier under a 'vanilla' parameterization.diploshic
feature vectors are calculated from the vcf output of the sweep simulationsdiploshic
prediction is performed using the trained model on the feature vector filesat the end of the day three new files are created within the individual simulation subdirs, e.g.
the
.preds
file has the windowed predictions and their probabilities. the lines of those files look like e.g.,so the coordinates on the simulated chunk (NOT THE CHROMOSOME COORDS) are being output here, for instance the first big window was from bases 1-1100000, 11 subwindows are used, and the center most to be classified occurs from 450001 550000
@mufernando @nspope if you have a chance you check this out? Can this output be aggregated easily enough into what you already have?