aamirwkhan06 commented 6 years ago

Dear SIFT 4G team

I followed the instructions from "https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB" to construct the database for a new organism. In my case, I am constructing the database for Cicer arietinum. I am getting the following error while running the script "make-SIFT-db-all.pl" using the following command:

Command: perl make-SIFT-db-all.pl -config test_files/cicer_arietinum_config.txt

Log: perl make-SIFT-db-all.pl -config test_files/cicer_arietinum_config.txt converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments cat: ./test_files/cicer_arietinum_genome/fasta/*.fasta: No such file or directory /data/ngs/Programs_latest/SIFT4G_v2.0.0/bin/sift4g -d ./test_files/protein_db/uniref90.fasta -q ./test_files/cicer_arietinum_genome/all_prot.fasta --subst ./test_files/cicer_arietinum_genome/subst --out ./test_files/cicer_arietinum_genome/SIFT_predictions --sub-results Checking query data and substitutions files

EXITING! No valid queries to process.

I also tried running the same script for the test human dataset provided with the package, but I am observing a different error:

Command: perl make-SIFT-db-all.pl -config test_files/homo_sapiens-test.txt

Log: converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /data/ngs/Programs_latest/SIFT4G_v2.0.0/bin/sift4g -d ./test_files/protein_db/uniref90.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

My config file for the Cicer arietinum looks like: GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=./test_files/cicer_arietinum_genome ORG=cicer_arietinum ORG_VERSION=v1.0 DBSNP_VCF_FILE=

Running SIFT 4G

SIFT4G_PATH=/data/ngs/Programs_latest/SIFT4G_v2.0.0/bin/sift4g PROTEIN_DB=./test_files/protein_db/uniref90.fasta COMPUTER=mrna

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

I need help to resolve this error and construction of the database for my organism.

It would be really appreciable if I can be guided to resolve the error mentioned.

I would be thankful for all the help.

Best regards Aamir

pauline-ng commented 6 years ago

Hi Aamir,

From the README on creating databases:

"In the gtf file, make sure the 9th column (attribute column) says gene_biotype "protein_coding;" for rows which are labelled as exon, CDS, stop_codon, and start_codon."

For each protein, you'll need to annotate exon, CDS, stop_codon, and start_codon. The scripts make use of these positions to translate genomic sequence into amino acid sequence.

Thanks, Pauline

pauline-ng commented 6 years ago

Hi Aamir,

I opened up the gtf.gz file you sent. It doesn't follow the format in the instructions:

"In the gtf file, make sure the 9th column (attribute column) says gene_biotype "protein_coding;" for rows which are labelled as exon, CDS, stop_codon, and start_codon."

For each protein, you'll need to annotate exon, CDS, stop_codon, and start_codon. The scripts make use of these positions to translate genomic sequence into amino acid sequence.

Thanks, Pauline

aamirwkhan06 commented 6 years ago

Dear Pauline

I added the 'gene_biotype "protein_coding";' to my GTF file for the entries tagged as CDS and exon, but the error is still the same (done making the fasta sequences start siftsharp, getting the alignments cat: ./test_files/cicer_arietinum_genome/fasta/*.fasta: No such file or directory /home/aamir/sift4g/bin/sift4g -d ./test_files/protein_db/uniref90.fasta -q ./test_files/cicer_arietinum_genome/all_prot.fasta --subst ./test_files/cicer_arietinum_genome/subst --out ./test_files/cicer_arietinum_genome/SIFT_predictions --sub-results Checking query data and substitutions files

EXITING! No valid queries to process. ) and fasta folder is empty. This time though it writes output to the file "protein_coding_genes.txt".

We do not have the information on the start_codon and stop_codon coordinates for the genes in the GFF/GTF file. Can you please suggest the mandatory features (gene, CDS, exon, stop_codon and start_codon etc) which should be present in the GTF file. I observed a lot of difference between my GTF file and the one for test human dataset.

Kindly suggest

Thank you

Best regards Aamir

pauline-ng commented 6 years ago

Hi Aamir,

The mandatory features are:

exon, CDS, stop_codon, and start_codon

If you don't have stop_codon and start_codon, the scripts can't tell what's UTR versus CDS.

Thanks, Pauline

pauline-ng commented 4 years ago

Closing issue.

(User does not have all of the required input values.)

pauline-ng commented 4 years ago

Updated the SIFT code to no longer require start/stop codon coordinates in order to build a database on Dec 5, 2019.

However, RNAs (such as miRNA, lincRNA) will not be annotated. UTRs are still annotated.

Thanks, Pauline

pauline-ng / SIFT4G_Create_Genomic_DB

Problem creating genomic database for new organism #6

Running SIFT 4G