popsim-consortium / analysis2

Analysis for the second consortium paper.
8 stars 14 forks source link

adding summary stat workflow #91

Closed andrewkern closed 1 year ago

andrewkern commented 1 year ago

Adding a separate workflow to compute summary statistics along simulated chromosomes.

The basic flow is tree sequence -> vcf -> diploSHIC, which I use to compute windowed summary statistics from each of the simulated chrms / seeds / populations / etc

The outputs are in a new directory produced by the workflow called results/summaries and the major products are csv files with the windowed stats summarized across seeds for each model/dfe/annotation/population/chrm combination along with plots of a subset of those stats -- currently pi, tajD, and ZnS

The directory structure of the output is as so for a given configuration with two seeds


results/summaries/OutOfAfrica_3G09
├── Gamma_H17
│   └── ensembl_havana_104_exons
│       ├── 1675701734
│       ├── 1845187043
│       ├── sim_chr21.allseeds.CEU.diploshic.stats
│       ├── sim_chr21.allseeds.CEU.pi.png
│       ├── sim_chr21.allseeds.CHB.diploshic.stats
│       ├── sim_chr21.allseeds.CHB.pi.png
│       ├── sim_chr21.allseeds.YRI.diploshic.stats
│       └── sim_chr21.allseeds.YRI.pi.png
└── none
    └── none
        ├── 1675701734
        ├── 1845187043
        ├── sim_chr21.allseeds.CEU.diploshic.stats
        ├── sim_chr21.allseeds.CEU.pi.png
        ├── sim_chr21.allseeds.CHB.diploshic.stats
        ├── sim_chr21.allseeds.CHB.pi.png
        ├── sim_chr21.allseeds.YRI.diploshic.stats
        └── sim_chr21.allseeds.YRI.pi.png

and here is an example plot from that showing two replicates from the two seeds for the CHB sample

sim_chr21 allseeds CHB pi

currently this PR could probably use some polishing, but I won't have time for it in the next few days, so perhaps others can lay eyes on this?

andrewkern commented 1 year ago

tagging you all @nspope @mufernando @stsmall -- hoping to get some eyes on this PR for calculating summary stats using diploshic

stsmall commented 1 year ago

@andrewkern OK. I am running through this now. Think it would be useful to plot the selection (DFE) runs with the neutral (None) runs on a plot? To see whether there is a reduction in diversity

andrewkern commented 1 year ago

yeah agreed. that would be a good addition.

andrewkern commented 1 year ago

i also realized that this is computing "diploid" stats now and we should probably show "haploid" stats instead. i'm going to push a quick commit while you're looking it over

stsmall commented 1 year ago

I played around with plotting. Esp useful would be a boxplot of the exons and intron for the different DFEs. I think it would be an easier check on whether diversity was being reduced. These ended up being some big changes, mainly because I cat all the diploshic files together across seeds and DFEs. It may be better as a branch after the PR is accepted.

stsmall commented 1 year ago

Everything else looked OK. Hard to tell from a pi landscape what was really happening and why I think a box plot would be a good addition.

chriscrsmith commented 1 year ago

Related to what stsmall was saying, here's neutral and selected overlaid (chr 21 CHB I think): pi_plot.pdf

stsmall commented 1 year ago

if you have the code handy ... can you add the exon locations w/ rug

chriscrsmith commented 1 year ago

boxplot, chr 21 CHB bp.pdf

stsmall commented 1 year ago

@chriscrsmith NICE!!! I was expecting a more dramatic diff possibe that since the annot is exons, if it was split out by exons only that it would be diff. having looked at the code, that is not such as easy thing on the fly since you need to be linked into stdpopsim since the data tables dont have that info.

chriscrsmith commented 1 year ago

me too

More of a difference with thetaH and thetaW. No difference in num haplotypes

andrewkern commented 1 year ago

@chriscrsmith is that fig one replicate each? I've got summary stats for chr13 for 10 reps at this point IIRC

chriscrsmith commented 1 year ago

I ran with tiny_config which says reps=2. (I see the two reps :)) My plots showed just one rep, I'll average the two

Looking closely at the 100 chunks—most .stats files are empty. Only a few chunks have output. Does that sound right?

I see. So it breaks up the contig into a number of jobs less than 100, the extra chunks are just empty.

andrewkern commented 1 year ago

that's right. some of those files will be empty. at the end of the day you should get a file structure like:

results/summaries/OutOfAfrica_3G09/Gamma_H17/ensembl_havana_104_exons
├── 1358822686
├── 137773603
├── 1402032511
├── 1675701734
├── 170765738
├── 1845187043
├── 290876774
├── 561383554
├── 789925285
├── 878579711
├── sim_chr13.allseeds.CEU.diploshic.stats
├── sim_chr13.allseeds.CEU.pi.png
├── sim_chr13.allseeds.CHB.diploshic.stats
├── sim_chr13.allseeds.CHB.pi.png
├── sim_chr13.allseeds.YRI.diploshic.stats
└── sim_chr13.allseeds.YRI.pi.png

where the *.allseeds.*.stats files have all the collected stats across seeds and the *.png files have summary stat looks like http://poppy.uoregon.edu/~adkern/stdpopsim/results/summaries/OutOfAfrica_3G09/Gamma_H17/ensembl_havana_104_exons/sim_chr13.allseeds.CEU.pi.png

andrewkern commented 1 year ago

I just had to push a little addition to this as pandas didn't like having lots of seeds

chriscrsmith commented 1 year ago

averaged two reps. looking much better! (running more reps now) pi_plot.pdf bp.pdf

andrewkern commented 1 year ago

It would be great to get this merged this week. where is everyone with this PR?

chriscrsmith commented 1 year ago

I say merge!

nspope commented 1 year ago

Sorry @andrewkern, I'm slammed this week -- haven't had time for a look.

mufernando commented 1 year ago

I gave it a high level look and it looked OK to me. Since this PR is self contained I see no reason not to merge it whenever?

stsmall commented 1 year ago

As is, it works as expected. It is self contained and I didnt get any errors. It is not currently a rule included in all, were you going to add it? It does take a fairly long time to run. tldr; merge away

andrewkern commented 1 year ago

i was not going to add a rule to the main workflow, but we could later. merging this now then and we can keep refining.