harrispopgen / mushi

[mu]tation [s]pectrum [h]istory [i]nference
https://harrispopgen.github.io/mushi/
MIT License
24 stars 6 forks source link

pipeline overhaul #66

Closed wsdewitt closed 3 years ago

wsdewitt commented 3 years ago

Summary

Previously, analysis of simulated data and human data (from 1000 Genomes Project) was performed with an scons pipeline and Jupyter notebooks, and published in the mushi docs. This branch replaces that with scalable nextflow pipelines intended for use on an SGE cluster. There are separate pipelines for simulated data and human data located in the pipelines directory. Each contains a README, a nextflow script, and a jupyter notebook to generate plots from the pipeline results (the notebooks contain committed output, for viewing on github).

Both the simulation and the 1KG pipelines introduce extensive exploration of model selection via hyperparameter sweeps. They both asses demographic inference from the folded SFS (implemented by @apragsdale), as well as unfolded.

This branch is named "30x1KG" because it initially attempted to use the preliminary high coverage (30x) 1000 Genomes (1KG) call sets from NY Genome Center. The SFS we computed from these call sets contained artifacts (depleted singletons, excessive high-frequency smile) and it seemed a better path would be to run on the new hg38 call set of the low coverage data, and wait for a future integrated call set for the high coverage data.

Notes

wsdewitt commented 3 years ago

I think it makes sense to merge this now, and open an issue to remind us to rebuild the docs when appropriate