Previously, analysis of simulated data and human data (from 1000 Genomes Project) was performed with an scons pipeline and Jupyter notebooks, and published in the mushi docs. This branch replaces that with scalable nextflow pipelines intended for use on an SGE cluster. There are separate pipelines for simulated data and human data located in the pipelines directory. Each contains a README, a nextflow script, and a jupyter notebook to generate plots from the pipeline results (the notebooks contain committed output, for viewing on github).
Both the simulation and the 1KG pipelines introduce extensive exploration of model selection via hyperparameter sweeps.
They both asses demographic inference from the folded SFS (implemented by @apragsdale), as well as unfolded.
This branch is named "30x1KG" because it initially attempted to use the preliminary high coverage (30x) 1000 Genomes (1KG) call sets from NY Genome Center. The SFS we computed from these call sets contained artifacts (depleted singletons, excessive high-frequency smile) and it seemed a better path would be to run on the new hg38 call set of the low coverage data, and wait for a future integrated call set for the high coverage data.
Notes
The docs have not been rebuilt, so notebook links in the preprint will still work after merge. This is intended to be temporary.
The core mushi code has no breaking changes or upgrades, so a package versioning is not required.
This needn't be merged immediately, since there are a few changes in the works that would make sense to merge here first
Summary
Previously, analysis of simulated data and human data (from 1000 Genomes Project) was performed with an scons pipeline and Jupyter notebooks, and published in the mushi docs. This branch replaces that with scalable nextflow pipelines intended for use on an SGE cluster. There are separate pipelines for simulated data and human data located in the
pipelines
directory. Each contains a README, a nextflow script, and a jupyter notebook to generate plots from the pipeline results (the notebooks contain committed output, for viewing on github).Both the simulation and the 1KG pipelines introduce extensive exploration of model selection via hyperparameter sweeps. They both asses demographic inference from the folded SFS (implemented by @apragsdale), as well as unfolded.
This branch is named "30x1KG" because it initially attempted to use the preliminary high coverage (30x) 1000 Genomes (1KG) call sets from NY Genome Center. The SFS we computed from these call sets contained artifacts (depleted singletons, excessive high-frequency smile) and it seemed a better path would be to run on the new hg38 call set of the low coverage data, and wait for a future integrated call set for the high coverage data.
Notes