In this repo are stored calculation scripts used as part of project for Genomics Informatics course.
snp_metrics.ipynb is currently our finall script. Other two are similiar and they are representing previous versions of our final script.
Objective of our project was to perform Variant calling on provided BAM file and using human genome as reference genome (fasta file) by using GATK 4 HaplotypeCaller and Frebayes variant calling tools.
2nd tasks was to compare output files from previous step taking HaplotypeCaller as truth set and Frebayes as test set and getting out various metrics from that comparison like True positives, False positivises, False negatives and also calculate precision recall and F-score metrics.
Before we start with our project, let’s first go through some basics we needed for this project.
Variant calling is the process of finding differences between reference genome and observed sample.
Variant Calling is usually final phase of DNA analysis
There are number of different genomic variants
Single nucleotide variant
Deletion
Insertion
Inversion
Copy number variant
Translocation
Whole genome duplication
Duplication (tandem or interspersed)
Different genomic variants can have different impact on human cells and organism SNV – Single nucleotide variant (simple alternation of single nucleotide but it can cause phenotype Based on variant location we can predict if mutation will have impact
GATK 4 is very large tool and it is able to preform many different tasks.
In our solution we use only small subset of it which name is, as you can guess HaplotypeCaller.
Prerequisite for using this tool is to have installed java-jdk and corresponding python and R libraries.
Fortunately there is more convenient way to achieve this by using official Docker image, so this was our choice.
We also used Seven Bridges platform in order to make app for our task. It is worth to note that there already was existing GTAK app on platform but it has much earlier release date and it is deprecated for our needs.
Here is running command (runned on CGC portal ):
gatk --java-options "-Xmx4g" HaplotypeCaller -R reference_file.fasta -I analised_file.bam -O output.vcf.gz
-R switch is reference file and I switch is for input BAM file, with -O switch we determinate name of our input file which is compressed VCF, so we will need to unzip it.
Frebayes is much smaller tool compare to GTAK and is specialized for finding small polymorphismsand.
Best thing is that it comes with pre built x64 bit linux library, so we don't need to install anything.
Again, here is run command (runned on ubuntu linux):
./freebayes -f reference_file.fasta -I analised_file.bam > var.vcf
with -f switch we state needed files and after grater mark we put name of output file. var.vcf in this case.
We can see that var.vcf, file generated by freebayes tool which is test set is much larger than file generated by GTAK which is truth set,
so it is expected to have much false positives
Algorithm is explained in this python script: snp_metrics.ipynb
Here are our results in graph representation:
For additional details please see our YouTube chanel.
And also here is presentation in this repo: powerpoint
Frebayes Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012