kipoi / kipoi-veff2

Variant effect prediction with a subset of models in Kipoi
MIT License
2 stars 1 forks source link

kipoi-veff2

Python 3.6 Coverage Status License: MIT

This is an Ensembl Variant Effect Predictor (VEP) like tool with a subset of kipoi models. Models in Kipoi can be broadly classified into two groups -

Available models

We currently support following model/ model groups.

Model group Type
Basset Variant centered
DeepSEA Variant centered
DeepBind Variant centered
MPRA-DragoNN Variant centered
pwm_HOCOMOCO Variant centered
Basenji Variant centered
MMSplice Interval based

Installation

The installation operates in two stages - first a conda environment is created with necessary dependencies and after that kipoi-veff2 is installed in that environment.

  1. Install the conda environment appropriate to your operating system

    Currently the provided conda environment(s) resolve in Ubuntu, MacOS and CentOS.

    Ubuntu

    conda env create -f environment.ubuntu.yml

    MacOS

    conda env create -f environment.osx.yml

    General purpose environment

    A more abridged version with minimal sets of dependencies is avaiable in environment.minimal.linux.yml. This has been tested on CentOS Linux with conda 4.7.10. This environment intentionally does not contain snakemake in order to keep it minimal. Be sure to install snakemake before using the Snakefile inside examples.

  2. Install kipoi-veff2

    conda activate kipoi-veff2
    python -m pip install .

    Note: For older version of conda (4.7.10), pinning cyvcf2 to 0.11 seems to work in CentOS Linux

kipoi-veff2 docker images

Alternatively, two ready to use docker images are available in dockerhub. Running the images will return a shell with activated conda environment with all dependencies and kipoi_veff2 already installed. The details are as follows

Pull docker images

docker pull kipoi/kipoi-veff2:py36 (Available with python=3.6)
docker pull kipoi/kipoi-veff2:py37 (Available with python=3.7)

Run docker image

docker run -v $PWD:/tmp kipoi/kipoi-veff2:py37 kipoi_veff2_predict /tmp/input.vcf /tmp/input.fa /tmp/output.tsv -m "DeepSEA/predict" -s "diff" -s "logit"

Tests

Package

pytest -k "not workflow"  tests

Workflow

cd examples && snakemake -j4 && cd ../ && pytest -k "workflow" tests

Usage

Variant centered

kipoi_veff2_predict <input-vcf> <input-fasta> <output-tsv> -m "DeepSEA/predict" 

or

from kipoi_veff2 import variant_centered

model_group = model_name.split("/")[0]
model_group_config_dict = (
    variant_centered.VARIANT_CENTERED_MODEL_GROUP_CONFIGS.get(
        model_group, {}
    )
)

model_config = variant_centered.get_model_config(model_name, **model_group_config_dict)

variant_centered.score_variants(
    model_config=model_config,
    vcf_file=vcf_file,
    fasta_file=fasta_file,
    output_file=output_file,
)

You can specify a list of scoring functions defined in kipoi_veff2.scores like so -

kipoi_veff2_predict  <input-vcf> <input-fasta> <output-tsv> -m "DeepSEA/predict" -s "diff" -s "logit"

or

from kipoi_veff2 import scores, variant_centered

model_group = model_name.split("/")[0]
model_group_config_dict = (
    variant_centered.VARIANT_CENTERED_MODEL_GROUP_CONFIGS.get(
        model_group, {}
    )
)

model_config = variant_centered.get_model_config(model_name, **model_group_config_dict)

variant_centered.score_variants(
    model_config=model_config,
    vcf_file=vcf_file,
    fasta_file=fasta_file,
    output_file=output_file,
    scoring_functions=[
            {"name": "diff", "func": scores.diff},
            {"name": "logit", "func": scores.logit},
        ]
)

Sequence length

Currently, there are three ways to define the required sequence length of a model in this category.

  1. Through cli using -l flag. This option has the highest priority and will over ride any default in the source code.

  2. Through variant_centered.VARIANT_CENTERED_MODEL_GROUP_CONFIGS. See the entry for pwm_HOCOMOCO as an example. Currently this feature is only available per model group.

  3. Otherwise, sequence length is inferred from auto_resize_len key of the respective dataloader description.

Scoring function

Currently, the scoring functions must be defined in kipoi_veff2.scores. By default, each model in this category has "diff" as a default scoring funciton. The only exception is Basenji which has "basenji_effect" as default. There are two ways to indicate which scoring function to use.

  1. Through cli using -s flag. This option has the highest priority and will over ride any default. Just specify the name of the function and it will infer which function to call.

  2. Through variant_centered.VARIANT_CENTERED_MODEL_GROUP_CONFIGS. See the entry for Basenji as an example. Currently this feature is only available per model group.

Batch size

Interval based

kipoi_veff2_predict <input-vcf> <input-fasta> -g <input-gtf> <output-tsv> -m "MMSplice/mtsplice"

or

from kipoi_veff2 import interval_based

model_config = interval_based.INTERVAL_BASED_MODEL_CONFIGS[model_name]
interval_based.score_variants(
    model_config=model_config,
    vcf_file=vcf_file,
    fasta_file=fasta_file,
    gtf_file=gtf_file,
    output_file=output_file,
)

Optional merge functionality

For model groups who have a large number of models (Example: DeepBind), it is more convenient to output a single file by merging all the scored effect predictions across all the models in the group. For this, we provide a merge cli as described below.

kipoi_veff2_merge output1.tsv output2.tsv ... output.10.tsv merged.tsv

Running multiple models and/or vcf/fasta pairs

Preparing the vcf and fasta files

Snakemake workflow

General recommendations