jeffersonfparil / imputef

Allele frequency imputation through linkage-informed k-nearest neighbour-derived weighted mean allele frequency prediction
GNU General Public License v3.0
0 stars 0 forks source link

imputef

Impute allele frequencies to reduce sparsity of genotype data from polyploids, pooled individuals, and populations.

Build Status License
License: GPL v3

Overview

Installation

Pre-compiled binaries

  1. Download the appropriate executable binary compatible with your system
  1. Configure and execute
chmod +x imputef
./imputef

Compile from source

  1. Clone the repository
git clone https://jeffersonfparil:<API_KEY>@github.com/jeffersonfparil/imputef.git main
  1. Load the Rust development environment via Conda (please see Conda installation instructions if you do not have Conda pre-installed)
cd imputef/
conda env create --file res/rustenv.yml
conda activate rustenv
  1. Compile and optionally create an alias or a symbolic link or add to $PATH
cargo build --release
target/release/imputef -h
# Option 1: alias
echo alias imputef="$(pwd)/target/release/imputef" >> ~/.bashrc
source ~/.bashrc
# Option 2: symlink
sudo ln -s $(pwd)/target/release/imputef /usr/bin/imputef
# Option 3: add to $PATH (Note that you need to place this in your ~/.bashrc or the appropriate shell initialisation file)
export PATH=${PATH}:$(pwd)/target/release
### Check
type -a imputef
cd ~; imputef -h; cd -

Usage

Quickstart

imputef -h
imputef -f tests/test.tsv   # allele frequency table as input (tab-delimited)
imputef -f tests/test.csv   # allele frequency table as input (comma-separated)
imputef -f tests/test.ssv   # allele frequency table as input (semi-colon-separated)
imputef -f tests/test.sync  # synchronised pileup file as input
imputef -f tests/test.vcf   # variant call format as input without missing data
imputef -f tests/test_2.vcf   # variant call format as input
imputef -f tests/test_2.vcf --method mean # use mean value imputation
imputef -f tests/test_2.vcf --min-loci-corr=0.75 --max-pool-dist=0.25 # define some minimum loci correlation and maximum genetic distance thresholds

Arguments

Argument or flag Description
--fname Filename of the genotype file to be imputed in uncompressed vcf, sync, or allele frequency table. Details on these genotype formats are available below.
--method Imputation method. Use "mean" for mean value imputation or "aldknni" for adaptive LD-kNN imputation. [Default = "aldknni"]
--min-coverage Minimum coverage per locus, i.e. if at a locus, a pool falls below this value (does not skip missing data, i.e. missing locus has a depth of zero), then the whole locus is omitted. Set this to zero if the vcf has been filtered and contains missing values, i.e. ./. or .|.. [Default = 0]
--min-allele-frequency Minimum allele frequency per locus, i.e. if at a locus, a pool has all its alleles below this value and/or above the additive complement of this value (skipping missing data), then the entire locus is omitted. [Default = 0.0001]
--max-missingness-rate-per-locus Maximum fraction of pools missing per locus, i.e. if at a locus, there were more pools missing than the coverage dictated by this threshold, then the locus is omitted. [Default = 1.00]
--pool-sizes Vector of pool sizes, i.e. the number of individuals included in each pool. Enter the pool sizes separated by commas ,. This can also be set to a single arbitrarily large value for example 100 for individual polyploids or if allele frequency estimates are expected to be accurate. [Default = 100.0]
--min-depth-below-which-are-missing Minimum depth at which loci with depth below this threshold are set to missing. Set to one if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1.00]
--max-depth-above-which-are-missing Maximum depth at which loci with depth above this threshold are set to missing. Set to some large arbitrarily large value (e.g. 1000000) if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an integer above zero. [Default = 1000000.0]
--frac-top-missing-pools Fraction of pools with the highest number of missing loci to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to a decimal number between zero and one. [Default = 0.0]
--frac-top-missing-loci Fraction of loci with the highest number of pools with missing data to be omitted. Set to zero if the input vcf has already been filtered and the loci beyond the depth thresholds have been set to missing, otherwise set to an decimal number between zero and one. [Default = 0.0]
--min-loci-corr Minimum correlation (Pearson's correlation) between the locus requiring imputation and other loci deemed to be in linkage with it. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.9]
--max-pool-dist Maximum genetic distance (mean absolute difference in allele frequencies) between the pool or sample requiring imputation and pools or samples deemed to be the closest neighbours. Ranges from 0.0 to 1.0, but use -1.0 or any negative number to perform per locus optimisations to find the best value minimising imputation. [Default = 0.1]
--min-l-loci Minimum number of linked loci to be used in estimating genetic distances between the pool or sample requiring imputation and other pools or samples (minimum value of 1). This argument overrides --min-loci-corr, i.e. the minimum number of loci will be met regardless of the minimum loci correlation threshold. [Default = 20]
--min-k-neighbours Minimum number of k-nearest neighbours of the pool or sample requiring imputation (minimum value of 1). This argument overrides --max-pool-dist, i.e. the minimum number of k-nearest neighbours will be met regardless of the maximum genetic distance threshold. [Default = 5]
--restrict-linked-loci-per-chromosome Restrict the choice of linked loci to within the chromosome the locus requiring imputation belongs to? [default: false] [Default = false; i.e. no flag]
--n-reps Number of replications for the estimation of imputation accuracy in terms of mean absolute error (MAE). It is used to define the number of random non-missing samples to use as replicates for the estimation of MAE and optimisation (minimum value of 1). [Default = 10]
--n-threads Number of computing threads or processor cores to use in the computations. [Default = 2]
--fname-out-prefix Prefix of the output files including the imputed allele frequency table (<fname-out-prefix>-<time>-<random-id>-IMPUTED.tsv). [Default = ""; which corresponds to the name of the input genotype file]

Genotype data formats

Note

Header line and comments should be prepended by '#'.

Variant call format (vcf)

Synchronised pileup (sync)

Allele frequency table (tsv)

Details

Imputation of allele frequency from polyploids and pooled samples via (1) mean value imputation, and (2) linkage disequillibrium (LD) k-nearest neighbour (kNN)-based weighted allele frequency prediction. The latter is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, LD is estimated using Pearson's product moment correlation per pair of loci. Mean absolute difference in allele frequencies is used to define the genetic distance between samples, instead of taxicab or Manhattan distance used in LD-kNNi. Four parameters can be set by the user:

  1. minimum loci correlation threshold - dictates the minimum LD between the locus requiring imputation and other loci which will be used to estimate genetic distance between samples;
  2. maximum genetic distance threshold - sets the maximum genetic distance between the sample requiring imputation and the samples (i.e. nearest neighbours) to be used in weighted mean imputation of missing allele frequencies;
  3. minimum number of loci linked to the locus requiring imputation - overrides minimum loci correlation threshold if this minimum is not met; and
  4. minimum k-nearest neighbours - overrides maximum genetic distance threshold if this minimum is not met.

The first two parameters (minimum loci correlation and maximum genetic distance thresholds) can be optimised per locus by setting --min-loci-corr=-1.0 and/or --max-pool-dist=-1.0.

--method="mean": mean value imputation

This imputation uses the arithmetic mean of the observed allele frequencies across all samples where the locus was genotyped:

$$ \hat q{r,j} = { {1 \over (n-m)} { \sum{i \ne r}^{n} q_{i,j} } } $$

where:

--method="aldknni": adaptive linkage disequilibrium (LD)-based k-nearest neighbour imputation of genotype data

This is an extension of the LD-kNNi method of Money et al, 2015, i.e. LinkImpute, which was an extension of the kNN imputation of Troyanskaya et al, 2001. Similar to LD-kNNi, linkage disequilibrium (LD) is estimated using Pearson's product moment correlation per pair of loci, which is computed per chromosome by default, but can be computed across the entire genome. We use the mean absolute difference/error (MAE) between allele frequencies among linked loci as an estimate of genetic distance between samples. Fixed values for the minimum correlation threshold to identify loci used in distance estimation, and maximum genetic distance threshold to select the k-nearest neighbours can be defined. Additionally, minimum number of loci to include in distance estimation, and minimum number of nearest neighbours can be set. Moreover, the minimum correlation and maximum genetic distance can be optimised per locus by minimising the MAE between predicted and expected allele frequencies using simulated missing or masked data. The number of masked data is controlled by the number of replications (--n-reps) and the total number of samples non-missing at the locus requiring imputation.

The imputed allele frequency is computed as:

$$ \hat q{r,j} = { \sum{i \ne r}^{k} q{i,j} (1 - \delta{i,r}) } $$

with:

$$ \delta{i,r} = { {1 \over \sum d{i,r}} d_{i,r} } $$

and

$$ d{i,r} = { {1 \over c} { \sum{j=1}^{c} |q{i,j} - q{r,j}| } } $$

where:

The variables $k$ and $c$ are proportional to the user inputs --max-pool-dist (default=0.1) and --min-loci-corr (default=0.9), respectively. The former defines the maximum distance of samples to be considered as one of the k-nearest neighbours, while the latter refers to the minimum correlation with the locus requiring imputation to be included in the estimation of the genetic distance.

LD estimation and imputation per se are multi-threaded, and the imputation output is written into disk as an allele frequency table. The structs, traits, methods, and functions defined in this tool are subsets of poolgen, and will eventually be merged.

Optimisation approaches

  1. Using the built-in grid-search optimisation for minimum loci correlation and maximum genetic distance thresholds per locus:
imputef -f tests/test_2.vcf --min-loci-corr=-1 --max-pool-dist=-1
  1. Grid-search optimisation for all four parameters which assumes a common set of optimal parameters across all loci
### Find the optimal l-loci to use for distance estimation, k-nearest neighbours, minimum loci correlation, and maximum distance
### i.e. the combination of these 4 parameters which minimises the mean absolute error between expected and predicted allele frequencies.
echo 'l,k,corr,dist,mae' > grid_search.csv
for l in 5 10 15
do
    for k in 1 2 3 4 5
    do
        for corr in 0.75 0.95 1.00
        do
            for dist in 0.0 0.10 0.25
            do
                echo "@@@@@@@@@@@@@@@@@@@@@@@@"
                echo ${l},${k},${corr},${dist}
                imputef -f tests/test_2.vcf \
                    --min-l-loci=${l} \
                    --min-k-neighbours=${k} \
                    --min-loci-corr=${corr} \
                    --max-pool-dist=${dist} \
                    --n-reps=3 > log.tmp
                fname_out=$(tail -n1 log.tmp | cut -d':' -f2 | tr -d ' ')
                mae=$(grep "Expected imputation accuracy in terms of mean absolute error:" log.tmp | cut -d':' -f2 | tr -d ' ')
                echo ${l},${k},${corr},${dist},${mae} >> grid_search.csv
                rm $fname_out log.tmp
            done
        done
    done
done
awk -F',' 'NR == 1 || $5 < min {min = $5; min_line = $0} END {print min_line}' grid_search.csv

Performance evaluation

Datasets:

Performance metrics:

$$ pi= \begin{cases} 0 \text{ if } \hat g \ne g{true}\ 1 \text{ if } \hat g = g_{true} \end{cases} $$

This is used for genotype classes, i.e., binned allele frequencies: $g = {{1 \over {ploidy}} round(q*ploidy)}$, here $q = P(allele)$. Note that there is alternative way of defining these genotype classes with strict boundaries, i.e., homozygotes have fixed allele frequencies.




Autotetraploid (Cocksfoot/Orchardgrass) mean absolute error

mae_barplots

Pool (Soybean pools) mean absolute error

mae_barplots

Diploid (Grape) mean absolute error

mae_barplots




Autotetraploid (Cocksfoot/Orchardgrass) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots

Pool (Soybean pools) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots

Diploid (Grape) concordance of observed and imputed genotype classes

concordance_genotype_classes_barplots




Troubleshooting

References

Acknowledgements

This work was conceived and developed during my employment in Agriculture Victoria. The imputation algorithm in this repository was inspired by the algorithms presented in the 3 papers above and Luke Pembletton's tetraploid imputation algorithm written in R. The core data structures, traits, and methods are largely shared with my open-source (GPLv3) project poolgen.