This repository contains scripts related to the empirical analyses of 1000 Genome dataset associated with Beichman et al. 2017) (https://www.ncbi.nlm.nih.gov/pubmed/28893846)
These individuals are extracted from 1000 Genomes:
YRI: NA18505, NA18517, NA18916, NA18923, NA18877, NA18909, NA18858, NA18865, NA19116, NA19096
CEU: NA06984, NA06985, NA06986, NA06989, NA06994, NA07000, NA07037, NA07051, NA07056, NA07347
CHB: NA18525, NA18526, NA18528, NA18530, NA18531, NA18532, NA18533, NA18534, NA18535, NA18536
To subset the 1000G vcf for each of these populations:
./subset_YRI.sh
./subset_CEU.sh
./subset_CHB.sh
Note that currenly the script is set up to run on UCLA Hoffman HPC
python generate_foldedSFS_fromVCF.py -h
usage: generate_foldedSFS_fromVCF.py [-h] --variant VARIANT --numAllele
NUMALLELE --pass_coordinates
PASS_COORDINATES --outfile OUTFILE
This script generates the count for a folded SFS from VCF
optional arguments:
-h, --help show this help message and exit
--variant VARIANT REQUIRED. Variant file. The format should be CHROM POS
ind1 ind2 etc. Should be tab delimit. Because of VCF
format, it is 1-based
--numAllele NUMALLELE
REQUIRED. Indicate the number of alleles, which is
equal to the number of individuals in your sample
times 2.
--pass_coordinates PASS_COORDINATES
REQUIRED. Input is the file that lists the coordinates
(1-based) that are annotated as P (pass) from the
masks file. This file is generated from the script
obtain_pass_positions.py. The format is a genomic
coordinate per line.
--outfile OUTFILE REQUIRED. Name of the output file
NOTE: when selecting human diversity, one has to choose either CEU, YRI, or CHB. The neutral regions will likely differ depending which population to choose. Therefore, should we have a consensus neutral regions for all three populations?
From the directory 1000G_Summary_Stats/data/10kb_neutral_regions, do:
for i in {1..22}; do python ../../scripts/calc_neutralregion_length.py chr${i}_output_from_nre.txt > chr${i}_output_from_nre_clean.txt done;
From the directory 1000G_Summary_Stats/data/10kb_neutral_regions, do:
for i in {1..22}; do python ../../scripts/generate_Xkb_neutralRegions.py --input chr${i}_output_from_nre_clean.txt --length 10000 > chr${i}_10kb_neutral_region.txt done;
qsub wrapper_subsetVCF.basedOnPositions_YRI_afterRmHom.sh
qsub wrapper_subsetVCF.basedOnPositions_CEU_afterRmHom.sh
qsub wrapper_subsetVCF.basedOnPositions_CHB_afterRmHom.sh
qsub wrapper_subsetVCF.basedOnPositions_YRI_for10kbNeutral.sh
qsub wrapper_subsetVCF.basedOnPositions_CEU_for10kbNeutral.sh
qsub wrapper_subsetVCF.basedOnPositions_CHB_for10kbNeutral.sh
./wrapper_generate.foldedSFS.fromVCF_10kbNeutral_YRI.sh
./wrapper_generate.foldedSFS.fromVCF_10kbNeutral_CEU.sh
./wrapper_generate.foldedSFS.fromVCF_10kbNeutral_CHB.sh
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
for chrNum in {1..22}; do grep -w "chr${chrNum}" hg19.chrom.sizes > chr${chrNum}.g done;
./makeWindows.sh 100000 100kb
This script tabulates the number of callable sites for each nonoverlapping window. Input 1: a list where each item in the list is a tuple of the form (start, end). Input 2: a set where each item is the callable position (1-based). Return: a dictionary where key is window in the form (start, end) and value is the count of callable sites.
This script cleans the VCF after subsetting. Specifically, it will (1) remove any site where the genotype for all subsetted individuals is 0|0. The reason for this is that vcf-subset does not do this automatically, (2) only keep the biallelic allele, in other words, remove 1|2, 2|1, and 2|2, and (3) remove any site that is not callable, meaning where it is not annotated with a P in the mask file. Input 1: a list. Each item in this list is a list where the first item is the genomic position (1-based). Input 2: a set where each item is the callable position (1-based). Return: a list. Each item in this list is a list where the first item is the genomic position (1-based). Basically the same as Input 1 but fewer variants.
This script computes the allele frequency for each variant.
This script computes pairwise pi.
python main.py -h
python main.py --windows /path/to/window/file --callableSet /path/to/callableSet --variants /path/to/variants --numAllele int --outfile /path/to/outfile
./rmHomozygous_from_subsetVCF.sh
./rmSingletons_from_subsetVCF.sh
vcftools --vcf --hap-r2 --ld-window-bp 100000
./vcftools_ld_YRI.sh
./vcftools_ld_CEU.sh
./vcftools_ld_CHB.sh
./processVCFLDoutputs.sh Remove nan
To compute rsquared in bins:
python estimateLDdecay.py -h
usage: estimateLDdecay.py [-h] --input INPUT --format FORMAT --bin BIN --outfile OUTFILE
This script estimates LD decay in bins. Bins can be specified by user
optional arguments: -h, --help show this help message and exit --input INPUT REQUIRED. Input file. This is usually output from plink or vcftools. --format FORMAT REQUIRED. Enter plink or vcftools. Specify which file format --bin BIN REQUIRED. Specify the number of bins --outfile OUTFILE REQUIRED. Name of output file.
./wrapper_estimateLDdecay_YRI.sh
./wrapper_estimateLDdecay_CEU.sh
./wrapper_estimateLDdecay_CHB.sh
python tabulateMeanLD.py
./vcftools_ld_YRI_geno.sh
./qsub vcftools_ld_CEU_geno.sh
./qsub vcftools_ld_CHB_geno.sh
/u/home/p/phung428/tanya_data_storage/1000G_Summary_Stats/data/decode_genetic_map
wget https://www.decode.com/additional/female_noncarrier.gmap
wget https://www.decode.com/additional/male_noncarrier.gmap
Because the files are not tab delimit, I need to convert the files to tab-delimit first
awk '{print$1"\t"$2"\t"$3"\t"$4}' female_noncarrier.gmap > female_noncarrier.gmap_tab
awk '{print$1"\t"$2"\t"$3"\t"$4}' male_noncarrier.gmap > male_noncarrier.gmap_tab
Partition into separate chromososomes
for i in {1..22}; do
grep -w chr${i} female_noncarrier.gmap_tab > chr${i}_female_noncarrier.gmap
grep -w chr${i} male_noncarrier.gmap_tab > chr${i}_male_noncarrier.gmap
done;
Compute average genetic map:
./wrapper_average.genetic.map.sh
interpolate_genetic_distance.py
. Wrapper script to run across 22 chromosomes is wrapper_interpolate_genetic_distance_10kbneutral.sh
qsub wrapper_interpolate_genetic_distance_10kbneutral.sh
compute_rec_10kb_neutral_region.py
. Wrapper script to run across 22 chromosomes is wrapper_compute_rec_10kb_neutral_region.sh
qsub wrapper_compute_rec_10kb_neutral_region.sh
/u/home/p/phung428/tanya_data_storage/1000G_Summary_Stats/results/LD_geno/VCFtools_out_rm.nan/YRI
, make some new directories:
for i in {1..22}; do mkdir chr${i}_split; done;
split -l 10000000 chr1_LD_geno.geno.ld_rm.nan
interpolate_genetic_distance_LD.py
. Wrapper script to run across 22 chromosomes is wrapper_interpolate.genetic.distance.LD.sh
qsub wrapper_interpolate.genetic.distance.LD.sh
Since I run each chunk of each chromosome separately, I wrote a few wrapper scripts so that it can be run in parallel. All of the wrapper scripts are stored in the directory wrapper_convert.physical.to.genetic
After that, I ran the Python script estimateLDdecay_genetic.py
to bin. Specifically, I ran the wrapper script:
qsub wrapper_estimateLDdecay_genetic.sh
Then, tabulate across 22 chromosomes:
python tabulateMeanLD_geno_genetic.py > /u/home/p/phung428/tanya_data_storage/1000G_Summary_Stats/results/LD_geno/bins_genetic/YRI/allChr_LD_genetic_bins