scholl-lab / vcf-filtering

A collection of scripts for filtering annotated variant call format files
MIT License
2 stars 1 forks source link
bcftools genetics kidney-disease rare-disease rare-variant-analysis snpeff snpsift vcf

vcf-filtering

This repository contains scripts and documentation for filtering variant call format (VCF) files to identify rare genetic variants in genes of interest using a streamlined bash script, filter_variants.sh.

Overview

The filter_variants.sh script performs the following steps:

  1. Extract Genes of Interest: Uses snpEff genes2bed to produce a BED file containing the genes of interest.
  2. Sort BED File: Sorts the generated BED file.
  3. Modify BED File: Adds "chr" prefix to the entries in the BED file if add_chr is set to true.
  4. Extract Variants: Uses bcftools to extract the variants in the BED file from the VCF.
  5. Filter for Rare Variants: Uses SnpSift to filter for rare variants based on the provided filter string.
  6. Extract Fields of Interest: Uses SnpSift again to extract the specified fields of interest.
  7. Modify Header: Removes the "ANN[0]" and "GEN[*]" prefixes from the header.
  8. Replace GT Values: Uses replace_gt_with_sample.sh to replace the GT values with the sample names.
  9. Save Output: Saves the output to a specified file.

Example of the shell pipeline the script is composing:

snpEff genes2bed GRCh38.mane.1.0.refseq OFD1 | sortBed | awk '{print "chr"$0}' | bcftools view ann.dbnsfp.vcf.gz -R - | SnpSift -Xmx8g filter " (( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE')) " | SnpSift -Xmx4g extractFields -s "," -e "NA" - CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT | sed -e '1s/ANN\[0\]\.//g; s/GEN\[\*\]\.//g' | ./replace_gt_with_sample.sh samples.txt 21 > OFD1_rare_variants.GCKD.tsv

Usage of filter_variants.sh

./filter_variants.sh [--config config_file] <gene_name> <vcf_file_location> [reference] [add_chr] [filters] [fields_to_extract] [sample_file] [replace_script_location] [output_file]

Parameters:

Configuration File

The script allows users to provide a configuration file containing default values for parameters. The configuration file is sourced if provided, and the values specified in it are used as defaults.

Example of a configuration file:

reference=GRCh38.mane.1.0.refseq
add_chr=true
filters=(( dbNSFP_gnomAD_exomes_AC[0] <= 2 ) | ( na dbNSFP_gnomAD_exomes_AC[0] )) & ((ANN[ANY].IMPACT has 'HIGH') | (ANN[ANY].IMPACT has 'MODERATE'))
fields_to_extract=CHROM POS REF ALT ID QUAL AC ANN[0].GENE ANN[0].FEATUREID ANN[0].EFFECT ANN[0].IMPACT ANN[0].HGVS_C ANN[0].HGVS_P dbNSFP_SIFT_pred dbNSFP_Polyphen2_HDIV_pred dbNSFP_MutationTaster_pred dbNSFP_CADD_phred dbNSFP_gnomAD_exomes_AC dbNSFP_gnomAD_genomes_AC dbNSFP_ALFA_Total_AC GEN[*].GT
sample_file=samples.txt
replace_script_location=./replace_gt_with_sample.sh
replace_script_options="--append-genotype"
output_file=variants.tsv

Generating the Sample File

To generate the sample file from a multi-sample VCF, you can use the following command:

bcftools view -h /path/to/your_multi_sample.vcf.gz | awk -F'    ' '{ for (i=10; i<=NF; ++i) printf "%s%s", $i, (i==NF ? RS : ",") }' > /path/to/samplefile.txt

Requirements