SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.

Running the code with Docker
Running the code locally
Creating your own database
Check the database

Running the code with Docker

This uses the CPU version of SIFT4G.

Build Docker image

git clone https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB.git
cd SIFT4G_Create_Genomic_DB
docker build -t sift4g_db .

Run an interactive container
```
docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash
```
Make sure to mount directories that contain:
- your protein database
- the directory containing SIFT4G_Create_Genomic_DB
- your gene and genome files

For example, the full path of my SIFT4G_Create_Genomic_DB directory is /home/pauline/SIFT4G_Create_Genomic_DB and my protein database is in /bigdrive/SIFT_databases/uniprot_sprot.fasta so my command is:

docker run -it --user $(id -u):$(id -g) -v /home/pauline:/home/pauline -v /bigdrive:/bigdrive sift4g_db /bin/bash

Test the Docker installation

3a. C. ruddii example

C. ruddii is a small genome and can quickly test if everything is working. The genome and gene files are automatically downloaded from Ensembl.

Set variables in configuration file.
cd <your_dir>/SIFT4G_Create_Genomic_DB/test_files/candidatus_carsonella_ruddii_pv_config.txt

Edit the config file to set \<PARENT_DIR>, \<PROTEIN_DB>
(See config details) . Be sure to use full paths

mkdir <PARENT_DIR>

Create the database inside the docker container:

# Make an interactive session of the docker container
docker run -it --user $(id -u):$(id -g) -v /home/pauline:/home/pauline -v /bigdrive:/bigdrive sift4g_db /bin/bash

# Run the code inside the container
perl make-SIFT-db-all.pl -config <your_dir>/SIFT4G_Create_Genomic_DB/test_files/candidatus_carsonella_ruddii_pv_config.txt --ensembl_download

It takes ~30 minutes for this database to be generated in \<PARENT_DIR>/\<ORG_VERSION>.

Check the database

The database should be in a folder named something like: candidatus_carsonella_ruddii_pv/ASM1036v1.34

Partial Homo sapiens example

This example uses local files to build a database of human chr21 and mitochondrial genes. Do this exercise if you are building a SIFT database with a genome that is on your local computer.

Files are already provided in /SIFT4G_Create_Genomic_DB/test_files/homo_sapiens_small

Genomic DNA (.fa.gz)
Gene annotation (.gtf.gz)
dbSNP annotations (.vcf.gz)

Set the variables in the config file.

cd <SIFT4G_Create_Genomic_DB directory>/test_files/

Set variables in the config file __homo_sapiens-test.txt__: \<PARENT_DIR> and \<PROTEIN_DB>

Note that \<PARENT_DIR> should be set to the full path of /test_files/homo_sapiens_small
SIFT scripts will look for the genome and gene annotation files in that folder (which are provided in this example).

Create the database:

 # Make an interactive session of the docker container
docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash

# Run the code to make the database inside the container
 perl make-SIFT-db-all.pl -config <SIFT4G_Create_Genomic_DB directory>/test_files/homo_sapiens-test.txt

It takes ~2 hours for human chr21 and mitochondria predictions to be generated in \<PARENT_DIR>/\<ORG_VERSION>.

Check the database

Please go to Creating your own database to read how to make your own database

You can also run the SIFT4G Annotator inside the container

docker run -it --user $(id -u):$(id -g) -v <your_directory>:<your_directory> sift4g_db /bin/bash

java -jar <Path to SIFT4G_Annotator> -c -i <Path to input vcf file> -d <Path to SIFT4G database directory> -r <Path to your results folder> -t

Running the code locally

Follow these instructions if you are NOT using Docker.

Requirements

SIFT 4G Algorithm
Perl
DBI
Bioperl for running DB::Fasta
LWP
Switch.pm (sudo apt-get install libswitch-perl)
Python3 (which is invoked withpython)
Installation

git clone https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB.git scripts_to_build_SIFT_db

Test Installation

C.ruddii example

C. ruddii is a small genome and can quickly test if everything is working. The genome and gene files are automatically downloaded from Ensembl.

Set variables in configuration file.
cd test_files/candidatus_carsonella_ruddii_pv_config.txt

Edit the config file to set \<PARENT_DIR>, \<SIFT4G_PATH>, \<PROTEIN_DB>
(See config details) . Be sure to use full paths

mkdir <PARENT_DIR>

Go back to the scripts_to_build_SIFT_db directory

cd ..
Make the database:

perl make-SIFT-db-all.pl -config test_files/candidatus_carsonella_ruddii_pv_config.txt --ensembl_download

It takes ~30 minutes for this database to be generated in \<PARENT_DIR>/\<ORG_VERSION>.
Check the database

The database should be in a folder named something like: candidatus_carsonella_ruddii_pv/ASM1036v1.34

Partial Homo sapiens example

This example uses local files to build a database of human chr21 and mitochondrial genes. Do this exercise if you are building a SIFT database with a genome that is on your local computer.

Files are already provided in scripts_to_build_SIFT_db/test_files/homo_sapiens_small

Genomic DNA (.fa.gz)
Gene annotation (.gtf.gz)
dbSNP annotations (.vcf.gz)

Set the variables in the config file.

cd scripts_to_build_SIFT_db/test_files/

Set variables in the config file __homo_sapiens-test.txt__: \<SIFT4G_PATH> and \<PROTEIN_DB>

Note that \<PARENT_DIR> is already set to ./test_files/homo_sapiens_small
SIFT scripts will look for the genome and gene annotation files in that folder (which are provided in this example).
Make the database:

cd ..
perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt

It takes ~2 hours for human chr21 and mitochondria predictions to be generated in \<PARENT_DIR>/\<ORG_VERSION>.
Check the database

Creating your own database

perl make-SIFT-db-all.pl -config <config_file> [--ensembl_download]

Directions for making a SIFT database from:

Ensembl (easiest)
local genomic and gene annotation file (.gtf)
local genomic and gene annotation file (.gff)

Creating a SIFT 4G Database from Ensembl

This will download genome and gene files directly from Ensembl with the option __--ensembl_download__

Set parameters in the config file.

Use test_files/saccharomyces_cerevisiae-template.txt as a template.

a. Set weblinks to Ensembl genome and gene annotation files: GENE_DOWNLOAD_SITE, PEP_FILE, CHR_DOWNLOAD_SITE
Optional: DBSNP_ORGANISM_DOWNLOAD_SITE
Config file details

b. Set output paths: PARENT_DIR, ORG, ORG_VERSION

c. Set genetic codes in GENETIC_CODE_TABLE, MITO_GENETIC_CODE_TABLE

If you're working with a vertebrate, it's
```
GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_CODE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial
```
d. Set SIFT4G paths: SIFT4G_PATH, PROTEIN_DB If using Docker, SIFT4G_PATH=SIFT4G_PATH=/sift4g/bin/sift4g
Create the database:

perl make-SIFT-db-all.pl -config <config file> --ensembl_download
Check the database
Annotate a VCF file with your database

Making a SIFT database from local genomic and gene annotation file (.gtf)

Create a config file.

a. Use test_files/homo_sapiens-test.txt as a template

cd test_files/
cp homo_sapiens-test.txt <my_org_config.txt>

b. In \<my_org_config.txt>, set \<PARENT_DIR>, \<ORG>, \<ORG_VERSION>, \<SIFT4G_PATH>, \<PROTEIN_DB>
Check GENETIC_CODE_TABLE and MITO_GENETIC_CODE_TABLE is set correctly.
Optional to set: \<DBSNP_VCF_FILE> If using Docker, SIFT4G_PATH=SIFT4G_PATH=/sift4g/bin/sift4g

See config details.
Put the genomic fasta files and the gene annotation files in their proper place:

a. Make the folders:

mkdir <PARENT_DIR>
mkdir <PARENT_DIR>/gene-annotation-src
mkdir <PARENT_DIR>/chr-src
mkdir <PARENT_DIR>/dbSNP

b. Move your files into the appropriate folders:

Put compressed genomic fasta files (.fa.gz) in \<PARENT_DIR>/chr-src

mv *.fa.gz <PARENT_DIR>/chr-src

Put compressed gene annotation file (gtf.gz) in \<PARENT_DIR>/gene-annotation-src

mv *.gtf.gz <PARENT_DIR>/gene-annotation-src

Optional: Put compressed dbSNP VCF file in \<PARENT_DIR>/dbSNP
mv *.vcf.gz <PARENT_DIR>/dbSNP

Optional: Put compressed protein file (.pep.all.fa.gz) in \<PARENT_DIR>/gene-annotation-src

mv *.pep.all.fa.gz <PARENT_DIR>/gene-annotation-src

This file is used for checking.

Example of the file structure can be found in test_files/homo_sapiens_small
Run command:

perl make-SIFT-db-all.pl -config <config_file>

WAIT...

This can take 1-24 hours, depending on the size of the genome. Ignore warnings. It's done when the computer shows:

All done!
Check the database
Annotate a VCF file with your database

Making a SIFT database from genomic DNA (.fa.gz) and gene annotation file (.gff)

Use this if you have a gff file (like that supplied from Phytozyme)

Download and install gffread

Convert the gene annotation .gff file to a .gtf file (because SIFT processes gtf files).

mv <gff3.gz> <PARENT_DIR>/gene-annotation-src  
gunzip <*.gff3.gz>
gffread <*.gff3> -T -o [FILENAME].gene.gtf
gzip [FILENAME].gene.gtf

Then follow instructions for building a database using a gtf file

Annotate VCF files with the SIFT Database

Download the SIFT 4G Annotator (a Java executable) here
Commandline: java -jar <Path to SIFT4G_Annotator> -c -i <Path to input vcf file> -d <Path to SIFT4G database directory> -r <Path to your results folder> -t

Note: If your database files start with a 'chr', you must rename them without the 'chr' for the annotator to work.

Example: mv chr19.gz -> 19.gz; mv chr19.regions 19.regions; mv chr19_SIFTDB_stats.txt 19_SIFTDB_stats.txt

Complete instructions here

Check the Database

The database is stored in \<PARENT_DIR>/\<ORG_VERSION> which was set in the config file.

cd <PARENT_DIR>/<ORG_VERSION>

A SIFT database is made for each chromosome in the file \<chr>.gz The SIFT database structure is described here

zcat <chr>.gz | more   # does it look all right?
zcat <chr>.gz | grep CDS | more

SIFT numeric scores will be in columns 10-12. If too many rows say "NA", that's a problem

Global stats are available:

more <PARENT_DIR>/<ORG_VERSION>/CHECK_GENES.LOG

The file CHECK_GENES.LOG contains a summary of SIFT predictions by chromosome. There are 4 columns:

Chr
Genes with SIFT Scores
Pos with SIFT scores
Pos with Confident Scores

The last line summarizes predictions for the entire genome:

ALL # (#/#) # (#/#) # (#/#)

Your database is done if the percentages are high for the last 3 different columns. Woohoo!

Configuration File

Parameter	Description
SIFT4G_PATH	Path to the executable sift4g (installed in step 1 of Requirements). If running with Docker, it's /sift4g/bin/sift4g
PROTEIN_DB	Path to the protein database (.fa or .fasta) Recommend UniProt 90
PARENT_DIR	Path where all the output will be generated. User must have write access.
ORG	Organism name, e.g. homo_sapiens, rat, etc. Should be one word and no special characters.
ORG_VERSION	The final prediction database will be in \<PARENT_DIR>/\<ORG_VERSION>
GENE_DOWNLOAD_SITE (Ensembl only)	URL to download gene annotation (gtf) file
PEP_FILE (Ensembl only)	URL to download the protein sequence file. This file is optional and used for quality control.
CHR_DOWNLOAD_SITE (Ensembl only)	URL to download the genome sequence (.fa)
DBSNP_ORGANISM_DOWNLOAD_SITE (Ensembl only)	URL folder to download dbSNP annotations (optional)
DBSNP_VCF_FILE (optional)	A *.vcf.gz file that will be used to annotate variants with dbSNP rs id's
GENETIC_CODE_TABLE	Genetic code to be used to translate the DNA sequence into proteins (integer value based on NCBI)*
GENETIC_CODE_TABLENAME (not used)	String to remind user what GENETIC_CODE_TABLE is being used
MITO_GENETIC_CODE_TABLE	Fasta sequences named Mt, chrM, or Mito will use this genetic code. (This is for mitochondrial sequences). To disable or edit this feature, edit the function chr_is_mito in dna_protein_subs.pl
MITO_GENETIC_CODE_TABLENAME (not used)	String to remind user what MITO_GENETIC_CODE_TABLE is being used
PLASTID_GENETIC_CODE_TABLE	Fasta sequences named Pt, PT, chrPt, chrPT, or chloroplast will use this genetic code. (This is for chlorplasts). To disable or edit this feature, edit the function chr_is_plastid in dna_protein_subs.pl

Monitoring the Database Creation Process

Because the database can take hours/days to complete, here is what to look for:

If the Terminal says ....	Check the following:
`making single records file`	`ls -lt <PARENT_DIR>/singleRecords/*` is being updated
`make the fasta sequences`	`ls fasta/* \| wc -l` number of files should be increasing
*processing database	SIFT 4G Algorithm is running
`populating databases`	`ls SIFT_predictions/*` Inspect the SIFT prediction files
	`ls -lt <PARENT_DIR>/singleRecords_with_scores/*` is being updated

pauline-ng / SIFT4G_Create_Genomic_DB

readme

SIFT4G_Create_Genomic_DB

Contents

Running the code with Docker

Partial Homo sapiens example

Running the code locally

Requirements

Installation

Test Installation

C.ruddii example

Partial Homo sapiens example

Creating your own database

Creating a SIFT 4G Database from Ensembl

Making a SIFT database from local genomic and gene annotation file (.gtf)

Making a SIFT database from genomic DNA (.fa.gz) and gene annotation file (.gff)

Annotate VCF files with the SIFT Database

Check the Database

Check the Database

Configuration File

Monitoring the Database Creation Process