github-storage / Genome_Informatics_project

0 stars 1 forks source link

Genome_Informatics_project

In this repo are stored calculation scripts used as part of project for Genomics Informatics course.

snp_metrics.ipynb is currently our finall script. Other two are similiar and they are representing previous versions of our final script.

Here is info about project:

Comparison of GATK HaplotypeCaller and Freebayes Variant Calling tools

Table of content

Task

Variant-Calling

Before we start with our project, let’s first go through some basics we needed for this project.
Variant calling is the process of finding differences between reference genome and observed sample.
Variant Calling is usually final phase of DNA analysis
There are number of different genomic variants
Single nucleotide variant
Deletion
Insertion
Inversion
Copy number variant
Translocation
Whole genome duplication
Duplication (tandem or interspersed)

variant_calling_img

Different genomic variants can have different impact on human cells and organism SNV – Single nucleotide variant (simple alternation of single nucleotide but it can cause phenotype Based on variant location we can predict if mutation will have impact

Tools

Generated VCF files

We can see that var.vcf, file generated by freebayes tool which is test set is much larger than file generated by GTAK which is truth set,
so it is expected to have much false positives

Manual calculation of metrics

Algorithm is explained in this python script: snp_metrics.ipynb

Here are our results in graph representation:

algorithm_graphs_img

Comparison of metrics calculated by different tools (Graphs)

tools_metrics.png

For additional details please see our YouTube chanel.

And also here is presentation in this repo: powerpoint

Resource used:

Frebayes Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012

HaplotypeCaller