simonhmartin / genomics_general

General tools for genomic analyses.
343 stars 93 forks source link

This is a collection of scripts for a range of genomic data processing and analysis.

Below are notes about some useful tools. Not everything is documented yet, but most scripts have some help information if you type python script.py -h

Contents


Installation and dependencies

There is no system installation required. Just download this entire repository using the green "Code" button at the top of this page, or with the bash command git clone https://github.com/simonhmartin/genomics_general.git. If you prefer to move scripts to the directory where you run them, bear in mind that many of these scripts require the script genomics.py to be present in the same directory (or on your PYTHONPATH). The easiest is just to leave them in the main directory and specify the full path to the script when running it from elsewhere.

The only dependency is numpy.

Most of the scripts now run in python 3. Some are still written for puthon 2, but those will be updated soon.


Processing VCF files

Most of my scripts use a processed .vcf format that I call .geno. This looks something like this:

#CHROM      POS      ind1      ind2      ind3
scaffold1   1        A/A       A/G       G|A
scaffold1   1        N/N       T/T       T|C

Missing data is denoted as N, and phased and unphased genotypes are shown conventionally with | and /.

The script parseVCF.py in the VCF_processing directory, will convert vcf to this format. It has various options for filtering based on read depth, genotype quality or any other flag in the FORMAT column of the vcf.

Example command

python VCF_processing/parseVCF.py -i input.vcf.gz --skipIndels --minQual 30 --gtf flag=DP min=5 max=50 -o output.geno.gz

You can read more about this script in the VCF_processing directory.


Filtering genotype files prior to further analysis

If you vcf file was not already filtered, or you would like to filter further. The script filtergenotypes.py has many options for filtering. Some examples include:

It requires the script genomics.py to be present in the same directory, or in your Python path.

Example command

python filterGenotypes.py --threads 4 -i input.geno.gz -o output.geno.gz --minAlleles 2 --minCalls 10 --thinDist 1000

Notes

python filterGenotypes.py -h will print a full list of command options.

By default, this script assumes that the .geno input file is is encoded as diploid genotypes with a phase operator (/ or |):

scaffold1  1        A/A       G/G       G|A

You can specify a different input formats using the -if, but this is not recommended.

You can also specify various putput formats using -of.

Output format Description Example
phased (default) Alleles separates by a phase operator. This doesn't mean the phase is known, just that it is indicated A/A G/G G\|A
diplo For diploids only. Genotypes are single bases denoting the diploid genotype, using ambiguity codes for heterozygotes A G R
alleles as above but without the phase operator AA GG GA
randomAllele Randomly pick one allele per individual A G A
coded Coded numerically as in the VCF 0/0 1/1 1\|0
bases Separate the alleles for each individual into different columns and also give different headers for each A A G G G A
counts Count of the minor allele in each individual 0 2 1

Diversity and divergence analyses in sliding windows

The script popgenWindows.py computes some standard population genomic statistics in sliding windows: pi, FST and DXY. It requires the script genomics.py to be present in the same directory, or in your Python path.

Example command

python popgenWindows.py -w 50000 -m 5000 -g input.geno.gz -o output.csv.gz -f phased -T 5 -p popA A1,A2,A3,A4 -p popB B1,B2,B3,B4,B6,B6 -p popC -p popD --popsFile pops.txt 

Notes

python popgenWindows.py -h Will print a full list of command arguments.

Wndow Type Description
coordinate Windows will cover a fixed range in the genome, which is defined as the window size. If there is missing data, this can lead to variable numbers of sites used for each window.
sites Each window will have the same number of sites. If there is missing data, this can lead to different absolute sizes for windows in terms of genome coordinates.
predefined This will analyse predefined windows provided using the --windCoords flag.
C1  popC
C2  popC
C3  popC
C4  popC
D1  popD
D2  popD
D3  popD
D4  popD

Allele frequencies per site

The script freq.py computes allele frequencies at each site, either for each base, the minor allele, or the derived allele (if an outgroup is provided). It requires the script genomics.py to be present in the same directory, or in your Python path.

Example command

python freq.py -g input.geno.gz -p pop1 -p pop2 --popsFile populations.txt --target derived --threads 10

Notes

Wndow Type Description
--target minor Export the frequency of the minor (rarer) allele at each site (based on all included individuals). This can be at most 0.5 for the whole dataset, but can exceed 0.5 in certain populations
--target derived Export the frequency of the derived allele. NOTE: this assumes that the last population listed with -p is the outgroup.

Site frequency spectrum

The script sfs.py computes site frequency spectrum (SFS, also called the allele frequency spectrum) from input variants. It requires the script genomics.py to be present in the same directory, or in your Python path.

The input for sfs.py is produced by freq.py (see above). These two scripts can be combined with a pipe.

Example commands

#1D folded (derived allele) sfs
python freq.py -g input.geno.gz -f phased --threads 10 -p pop1 -p pop2 -p outgroup | \
python sfs.py --inputType baseCounts -p pop1 -p pop2 -p outgroup \
--FSpops pop1 pop2 --polarized --subsample 10 10 --pref mydata. --suff .subsample10.sfs

#2D unfolded (minor allele) SFS
python freq.py -g input.geno.gz -f phased --threads 10 -p pop1 -p pop2 | \
python sfs.py --inputType baseCounts -p pop1 -p pop2 \
--FSpops pop1 pop2 --doPairs --subsample 10 10 --pref mydata. --suff .subsample10.sfs

Notes


Distance matrix

The script distMat.py computes a distance matrix among all pairs of individuals. This can be computed either for the entire input file or in windows, as in the popgenWindows script above. This works for samples of any ploidy or mix of ploidies. For ploidy > 1, the pairwise diatance will be the average diatance among all haplotypes in the two individuals.

Example Command

python distMat.py -g input.geno.gz -f phased --windType cat -o output.dist

python distMat.py -h will print a list of command options.

Notes


ABBA-BABA statistics in sliding windows

The script ABBABABAwindows.py performs analyses described in Martin et al. 2015, MBE, compurting the D statistic and f estimators in windows across the genome. Like the script above, it requires genomics.py.

Example command

python ABBABABAwindows.py -g /zoo/disk1/shm45/vcf/set62/set62.chr21.DP5GQ30.AN100MAC1.diplo.gz -f phased -o output.csv -w 100000 -m 100 -s 100000 -P1 A -P2 B -P3 C -O D -T 10 --minData 0.5 --popsFile pops.txt --writeFailedWindows --polarize &

python ABBABABAwindows.py -h Will print a full list of command arguments.

Notes

Output

Column Header Description
scaffold The scaffold the window is on (all windows are on a single scaffold)
start window start position (inclusive)
end window end position (NOTE, this can exceed the length of the scaffold)
sites Number of genotypes sites in the input file in each window
sitesUsed number of sites used to compute statistics (biallelic SNPs)
ABBA Pseudo count of ABBA sites (including polymorphic sites) (See Martin et al. 2015 Equation 2)
BABA Pseudo count of BABA sites (including polymorphic sites) (See Martin et al. 2015 Equation 3)
D D statistic (see Durand et al. 2011 Equation 2)
fd fd admixture estimation (See Martin et al. 2015 Equation 6)
fdM Malinsky's modified statistic, fdM to accomodate admixture between either P1 and P3 or P2 and P3 (See Malinsky et al. 2015 Supplementart Material Page 8)

Trees for sliding windows

Two scripts in the phylo/ directory will make trees in sliding windows: phymlWindows.py and raxmlWindows.py. As the names suggest they use Phyml and RAxML, respectively.

Example command

python phyml_sliding_windows.py -T 10 -g input.phased.geno.gz --prefix output.phyml_bionj.w50 -w 50 --windType sites --model GTR 

python phymlWindows.py -h Will print a full list of command arguments.

Notes


Classify coding sites

Given a genome annotation file, the script codingSiteTypes.py classifies each site within a codon according to its codon position, synonymous/non-synonymopus (if it's a variant) and degeneracy.

Example Command

python codingSiteTypes.py -a annotation.gff3.gz -f gff -r reference.fasta.gz -v variants.vcf.gz --ignoreConflicts | bgzip > output.coding_site_types.tsv.gz

python distMat.py -h will print a list of command options.

Notes