Impute allele frequencies to reduce sparsity of genotype data from polyploids, pooled individuals, and populations.
Build Status | License |
---|---|
chmod +x imputef
./imputef
imputef-x86_64-windows.exe -h
.git clone https://jeffersonfparil:<API_KEY>@github.com/jeffersonfparil/imputef.git main
cd imputef/
conda env create --file res/rustenv.yml
conda activate rustenv
cargo build --release
target/release/imputef -h
# Option 1: alias
echo alias imputef="$(pwd)/target/release/imputef" >> ~/.bashrc
source ~/.bashrc
# Option 2: symlink
sudo ln -s $(pwd)/target/release/imputef /usr/bin/imputef
# Option 3: add to $PATH (Note that you need to place this in your ~/.bashrc or the appropriate shell initialisation file)
export PATH=${PATH}:$(pwd)/target/release
### Check
type -a imputef
cd ~; imputef -h; cd -
imputef -h
imputef -f tests/test.tsv # allele frequency table as input (tab-delimited)
imputef -f tests/test.csv # allele frequency table as input (comma-separated)
imputef -f tests/test.ssv # allele frequency table as input (semi-colon-separated)
imputef -f tests/test.sync # synchronised pileup file as input
imputef -f tests/test.vcf # variant call format as input without missing data
imputef -f tests/test_2.vcf # variant call format as input
imputef -f tests/test_2.vcf --method mean # use mean value imputation
imputef -f tests/test_2.vcf --min-loci-corr=0.75 --max-pool-dist=0.25 # define some minimum loci correlation and maximum genetic distance thresholds
Argument or flag | Description |
---|---|
--fname | Filename of the genotype file to be imputed in uncompressed vcf, sync, or allele frequency table. Details on these genotype formats are available below. |
--method | Imputation method. Use "mean" for mean value imputation or "aldknni" for adaptive LD-kNN imputation. [Default = "aldknni"] |
--min-coverage | Minimum coverage per locus, i.e. if at a locus, a pool falls below this value (does not skip missing data, i.e. missing locus has a depth of zero), then the whole locus is omitted. Set this to zero if the vcf has been filtered and contains missing values, i.e. ./. or .|. . [Default = 0] |
--min-allele-frequency | Minimum allele frequency per locus, i.e. if at a locus, a pool has all its alleles below this value and/or above the additive complement of this value (skipping missing data), then the entire locus is omitted. [Default = 0.0001] |
--max-missingness-rate-per-locus | Maximum fraction of pools missing per locus, i.e. if at a locus, there were more pools missing than the coverage dictated by this threshold, then the locus is omitted. [Default = 1.00] |
--pool-sizes | Vector of pool sizes, i.e. the number of individuals included in each pool. Enter the pool sizes separated by commas , . This can also be set to a single arbitrarily large value for example 100 for individual polyploids or if allele frequency estimates are expected to be accurate. [Default = 100.0] |
--min-depth-below-which-are-missing | Minimum depth at which loci with depth below this threshold are set to missing. Set to one if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1.00] |
--max-depth-above-which-are-missing | Maximum depth at which loci with depth above this threshold are set to missing. Set to some large arbitrarily large value (e.g. 1000000) if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1000000.0] |
--frac-top-missing-pools | Fraction of pools with the highest number of missing loci to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to a decimal number between zero and one. [Default = 0.0] |
--frac-top-missing-loci | Fraction of loci with the highest number of pools with missing data to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an decimal number between zero and one. [Default = 0.0] |
--min-loci-corr | Minimum correlation (Pearson's correlation) between the locus requiring imputation and other loci deemed to be in linkage with it. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.9] |
--max-pool-dist | Maximum genetic distance (mean absolute difference in allele frequencies) between the pool or sample requiring imputation and pools or samples deemed to be the closest neighbours. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.1] |
--min-l-loci | Minimum number of linked loci to be used in estimating genetic distances between the pool or sample requiring imputation and other pools or samples (minimum value of 1). This argument overrides --min-loci-corr , i.e. the minimum number of loci will be met regardless of the minimum loci correlation threshold. [Default = 20] |
--min-k-neighbours | Minimum number of k-nearest neighbours of the pool or sample requiring imputation (minimum value of 1). This argument overrides --max-pool-dist , i.e. the minimum number of k-nearest neighbours will be met regardless of the maximum genetic distance threshold. [Default = 5] |
--restrict-linked-loci-per-chromosome | Restrict the choice of linked loci to within the chromosome the locus requiring imputation belongs to? [default: false] [Default = false; i.e. no flag] |
--n-reps | Number of replications for the estimation of imputation accuracy in terms of mean absolute error (MAE). It is used to define the number of random non-missing samples to use as replicates for the estimation of MAE and optimisation (minimum value of 1). [Default = 10] |
--n-threads | Number of computing threads or processor cores to use in the computations. [Default = 2] |
--fname-out-prefix | Prefix of the output files including the imputed allele frequency table (<fname-out-prefix>-<time>-<random-id>-IMPUTED.tsv ). [Default = ""; which corresponds to the name of the input genotype file] |
Header line and comments should be prepended by '#'.
AD
field (allele depth), and may or may not have genotypes called (e.g. generated via bctools mpileup -a AD,DP ...). If the GT
field is present but the AD
field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool.AD
; i.e. the unfiltered allele depth which includes the reads which did not pass the variant caller filters) is used to calculate allele frequencies.GT
field is present but the AD
field is absent, then each sample is assumed to be an individual diploid, i.e., neither a polyploid nor a pool. If both GT
and AD
fields are present, then the AD
field takes priority.tests/test.vcf
for an example.#chr\tpos\tref\t<id_1>\t<id_2>\t<id_3>\t<id_4>\t<id_5>
tests/test.sync
for an example.#chr\tpos\tallele\t<pool_name_1>\t...\t<pool_name_n>
tests/test.tsv
for an example.Imputation of allele frequency from polyploids and pooled samples via (1) mean value imputation, and (2) linkage disequillibrium (LD) k-nearest neighbour (kNN)-based weighted allele frequency prediction. The latter is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, LD is estimated using Pearson's product moment correlation per pair of loci. Mean absolute difference in allele frequencies is used to define the genetic distance between samples, instead of taxicab or Manhattan distance used in LD-kNNi. Four parameters can be set by the user:
The first two parameters (minimum loci correlation and maximum genetic distance thresholds) can be optimised per locus by setting --min-loci-corr=-1.0
and/or --max-pool-dist=-1.0
.
--method="mean"
: mean value imputationThis imputation uses the arithmetic mean of the observed allele frequencies across all samples where the locus was genotyped:
$$ \hat q{r,j} = { {1 \over (n-m)} { \sum{i \ne r}^{n} q_{i,j} } } $$
where:
--method="aldknni"
: adaptive linkage disequilibrium (LD)-based k-nearest neighbour imputation of genotype dataThis is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, linkage disequilibrium (LD) is estimated using Pearson's product moment correlation per pair of loci, which is computed per chromosome by default, but can be computed across the entire genome. We use the mean absolute difference/error (MAE) between allele frequencies among linked loci as an estimate of genetic distance between samples. Fixed values for the minimum correlation threshold to identify loci used in distance estimation, and maximum genetic distance threshold to select the k-nearest neighbours can be defined. Additionally, minimum number of loci to include in distance estimation, and minimum number of nearest neighbours can be set. Moreover, the minimum correlation and maximum genetic distance can be optimised per locus by minimising the MAE between predicted and expected allele frequencies using simulated missing or masked data. The number of masked data is controlled by the number of replications (--n-reps
) and the total number of samples non-missing at the locus requiring imputation.
The imputed allele frequency is computed as:
$$ \hat q{r,j} = { \sum{i \ne r}^{k} q{i,j} (1 - \delta{i,r}) } $$
with:
$$ \delta{i,r} = { {1 \over \sum d{i,r}} d_{i,r} } $$
and
$$ d{i,r} = { {1 \over c} { \sum{j=1}^{c} |q{i,j} - q{r,j}| } } $$
where:
The variables $k$ and $c$ are proportional to the user inputs --max-pool-dist
(default=0.1) and --min-loci-corr
(default=0.9), respectively. The former defines the maximum distance of samples to be considered as one of the k-nearest neighbours, while the latter refers to the minimum correlation with the locus requiring imputation to be included in the estimation of the genetic distance.
LD estimation and imputation per se are multi-threaded, and the imputation output is written into disk as an allele frequency table. The structs, traits, methods, and functions defined in this tool are subsets of poolgen, and will eventually be merged.
imputef -f tests/test_2.vcf --min-loci-corr=-1 --max-pool-dist=-1
### Find the optimal l-loci to use for distance estimation, k-nearest neighbours, minimum loci correlation, and maximum distance
### i.e. the combination of these 4 parameters which minimises the mean absolute error between expected and predicted allele frequencies.
echo 'l,k,corr,dist,mae' > grid_search.csv
for l in 5 10 15
do
for k in 1 2 3 4 5
do
for corr in 0.75 0.95 1.00
do
for dist in 0.0 0.10 0.25
do
echo "@@@@@@@@@@@@@@@@@@@@@@@@"
echo ${l},${k},${corr},${dist}
imputef -f tests/test_2.vcf \
--min-l-loci=${l} \
--min-k-neighbours=${k} \
--min-loci-corr=${corr} \
--max-pool-dist=${dist} \
--n-reps=3 > log.tmp
fname_out=$(tail -n1 log.tmp | cut -d':' -f2 | tr -d ' ')
mae=$(grep "Expected imputation accuracy in terms of mean absolute error:" log.tmp | cut -d':' -f2 | tr -d ' ')
echo ${l},${k},${corr},${dist},${mae} >> grid_search.csv
rm $fname_out log.tmp
done
done
done
done
awk -F',' 'NR == 1 || $5 < min {min = $5; min_line = $0} END {print min_line}' grid_search.csv
Datasets:
Performance metrics:
$$ pi= \begin{cases} 0 \text{ if } \hat g \ne g{true}\ 1 \text{ if } \hat g = g_{true} \end{cases} $$
This is used for genotype classes, i.e., binned allele frequencies: $g = {{1 \over {ploidy}} round(q*ploidy)}$, here $q = P(allele)$. Note that there is alternative way of defining these genotype classes with strict boundaries, i.e., homozygotes have fixed allele frequencies.
u8
instead of f64
in calculating Pearson's correlation between pairs of loci. However, if you encountered this OOM error, please consider using the flag --restrict-linked-loci-per-chromosome
to estimate pairwise LD per chromosome only. This assumes that you have a dense coverage of the genome, i.e., there are enough markers with non-missing data to accurately determine relationships between loci and samples to yield good imputation accuracies.This work was conceived and developed during my employment in Agriculture Victoria. The imputation algorithm in this repository was inspired by the algorithms presented in the 3 papers above and Luke Pembletton's tetraploid imputation algorithm written in R. The core data structures, traits, and methods are largely shared with my open-source (GPLv3) project poolgen.