Statistical inference using long IBD segments

New features are actively under construction (Fall 2024).

Integrating multiple testing correction to be user-friendly
The applet filter-lines.jar gets corrupted. Will write a .py script.
Documenting simulation results for upcoming paper

Contact sethtem@umich.edu or Github issues for troubleshooting.

See misc/announcements.md for high-level updates on this repo.

See misc/fixes.md for any major bug fixes.

See misc/usage.md to evaluate if this methodology fits your study.

See misc/cluster-options.md for some suggested cluster options to use in pipelines.

See on GitHub "Issues/Closed" for some comments I/Seth left about the pipeline.

Citation

Please cite if you use this package.

Methods to model selection:

Temple, S.D., Waples, R.K., Browning, S.R. (2024). Modeling recent positive selection using identity-by-descent segments. The American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2024.08.023.

Methods to simulate IBD segments and our central limit theorems:

Temple, S.D., Thompson, E.A. (2024). Identity-by-descent in large samples. Preprint at bioRxiv, 2024.06.05.597656. https://www.biorxiv.org/content/10.1101/2024.06.05.597656v1.

Multiple testing correction for selection scan

Temple, S.D. (2024). "Statistical Inference using Identity-by-Descent Segments: Perspectives on Recent Positive Selection. PhD thesis (University of Washington). https://www.proquest.com/docview/3105584569?sourcetype=Dissertations%20&%20Theses.

Methodology

Acronym: incomplete Selective sweep With Extended haplotypes Estimation Procedure

This software presents methods to study recent, strong positive selection.

By recent, we mean within the last 500 generations
By strong, we mean selection coefficient s >= 0.015 (1.5%)

The methods relate lengths of IBD segments to a coalescent model under selection.

We assume 1 selected allele at a locus.

Our methods are implemented automatically in a `snakemake` pipeline:

A genome-wide selection scan for anomalously large IBD rates
- With multiple testing correction
Inferring anomalously large IBD clusters
Ranking alleles based on evidence for selection
Computing a measure of cluster agglomeration (Gini impurity index)
Estimating frequency and location of unknown sweeping allele
Estimating a selection coefficient
Estimating a confidence interval

The input data is:

See misc/usage.md.

Whole genome sequences
- Probably at least > 500 diploids
- Phased vcf data 0|1
- No apparent population structure
- No apparent close relatedness
- A genetic map (bp ---> cM)
- If not available, create genetic maps w/ uniform rate
- Recombining diploid chromosomes
- Not extended to human X chromosome
Access to cluster computing
- For human-scale data, you should have at least 25 Gb of RAM and 6 CPUs on a node.
- More memory and cores for TOPMed or UKBB-like sequence datasets
- Not extended to cloud computing

The chromosome numbers in genetic maps should match the chromosome numbers in VCFs.

The genetic maps should be tab-separated.

Repository overview

This repository contains a Python package and some Snakemake bioinformatics pipelines.

The package ---> src/
The pipelines ---> workflow/

You should run all snakemake pipelines in their workflow/some-pipeline/.

You should be in the mamba activate isweep environment for analyses.

You should run the analyses using cluster jobs.

We have made README.md files in most subfolders.

Installation

See misc/installing-mamba.md to get a Python package manager.

Clone the repository

git clone https://github.com/sdtemple/isweep.git

Get the Python package

mamba env create -f isweep-environment.yml

mamba activate isweep

python -c 'import site; print(site.getsitepackages())'

Download software.
```
bash get-software.sh software 
```
- Puts these in a folder called software/.
- Requires wget.
- For simulation study, download SLiM yourself.
- Put in software/.
- https://messerlab.org/slim/
- You need to cite these software.

See workflow/other-methods/ folder for how we run methods we compare to.

Running the procedure:

This is the overall procedure. You will see more details for each step in workflow/some-pipeline/README.md files.

Pre-processing

Phase data w/ Beagle or Shapeit beforehand. Subset data in light of global ancestry and close relatedness.

Here is a pipeline we built for these purposes: https://github.com/sdtemple/flare-pipeline
You could use IBDkin to detect close relatedness: https://github.com/YingZhou001/IBDkin
You could use PCA, ADMIXTURE, or FLARE to determine global ancestry.

Main analysis

Make pointers to large (phased) vcf files.
Edit YAML files in the different workflow directories.
Run the selection scan (workflow/scan).
```
nohup snakemake -s Snakefile-scan.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml & 
```
- See the file misc/cluster-options.md for support.
- Recommendation: do a test run with your 2 smallest chromosomes.
- Check *.log files from ibd-ends. If it recommends an estimated err, change error rate in YAML file.
- Then, run with all your chromosomes.
Estimate recent effective sizes :workflow/scan/scripts/run-ibdne.sh.
Make the Manhattan plot: workflow/scan/scripts/manhattan.py.
Checkout the roi.tsv file.
- Edit with locus names if you want.
- Edit to change defaults: additive model and 95% confidence intervals.

Run the region of interest analysis (workflow/roi).

nohup snakemake -s Snakefile-roi.smk -c1 --cluster "[options]" --jobs X --configfile *.yaml &

Picture of selection scan workflow

The flow chart below shows the steps ("rules") in the selection scan pipeline.

Diverting paths "mle" versus "scan" refer to different detection thresholds (3.0 and 2.0 cM).

See dag-roi.png for the steps in the sweep modeling pipeline.

Development things to do

cM, not bp, windowing
Replace filter-lines.jar with a python script
- Applet prone to corruption
Integrate multiple testing correction into pipeline
Extension to IBD mapping

sdtemple / isweep

readme

Statistical inference using long IBD segments

Citation

Methods to model selection:

Methods to simulate IBD segments and our central limit theorems:

Multiple testing correction for selection scan

Methodology

Our methods are implemented automatically in a `snakemake` pipeline:

The input data is:

Repository overview

Installation

Running the procedure:

Pre-processing

Main analysis

Picture of selection scan workflow

Development things to do

sdtemple / isweep

readme

Statistical inference using long IBD segments

Citation

Methods to model selection:

Methods to simulate IBD segments and our central limit theorems:

Multiple testing correction for selection scan

Methodology

Our methods are implemented automatically in a snakemake pipeline:

The input data is:

Repository overview

Installation

Running the procedure:

Pre-processing

Main analysis

Picture of selection scan workflow

Development things to do

Our methods are implemented automatically in a `snakemake` pipeline: