pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

How to address poor database statistics #57

Closed andrewSharo closed 2 years ago

andrewSharo commented 2 years ago

Hi Pauline,

Thanks for developing this resource to create SIFT4G databases for any organism. It's quite easy to use. I am annotating a new bird genome, the Greater prairie chicken (Tympanuchus cupido). Unfortunately, my database statistics are quite low. When I use UniRef90 as the protein database, I get the following overall stats:

ALL 23 (4351/18874) 39 (16156279/41401234) 15(2441297/16156279)

I thought that maybe a larger database could help, so I tried using the ncbi non-redundant database, but that actually resulted in even worse statistics:

ALL 19 (3572/18874) 33 (13754937/41401234) 6(890807/13754937)

I started with a gff file, and used gffread to convert to gtf. I did not personally create the gff file, but it was created with Maker based on transcriptome data. I am not sure how to assess the quality of this gff.

Do you have any ideas for how I could improve my overall SIFT database statistics on this organism? When I ran the test cases, I found very high database statistics, so I believe I am running everything correctly. I've pasted my config file below.

Best, Andrew

GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=./test_files/Tympanuchus_cupido_full ORG=Typanuchus_cupido ORG_VERSION=satsuma

DBSNP_VCF_FILE=

Running SIFT 4G

SIFT4G_PATH=/redser4/personal/andrew/src/sift4g/bin/sift4g PROTEIN_DB=/redser4/personal/andrew/uniprot/uniref90.fasta

Sub-directories, don't need to change

GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP

Doesn't need to change

FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

pauline-ng commented 2 years ago

Hi Andrew,

Let's check if the gtf file is properly being converted to protein sequences.

Can you run:

grep "^>" all_prot.fasta | wc -l

What is the # that you get? Is it similar to the number of expected proteins in the genome?

Thanks, Pauline

andrewSharo commented 2 years ago

Hi Pauline,

Thanks for your quick response. After running the command you gave me, I get "18804". This is close to the number of proteins I would expect in a typical bird genome.

Best, Andrew

pauline-ng commented 2 years ago

Andrew,

Hmmm... when you do a

more all_prot.fasta

and do a quick scan-through, these look like legitimate proteins (they don't contain X or in the middle -- where means stop )?

Very strange, this is a chicken and not an obscure organism, I expect your stats to be higher.

Best, Pauline

andrewSharo commented 2 years ago

Hi Pauline,

Yes, it is strange. The greater prairie chicken is a few million years diverged from chicken, but I would still expect there would be many similar sequences between it and chickens.

I had not considered your recommendation before. Several of the proteins do contain two or more Xs in the middle. When I run

cat all_prot.fasta | grep X | wc -l

I get 7630. Scanning through, most of these Xs are in the middle of the protein, not the end. When I run

cat all_prot.fasta | grep "\*" | wc -l

I get 365. Scanning through, all of these * characters are at the beginning of the sequence.

Do you think this could explain the poor stats?

pauline-ng commented 2 years ago

It wouldn't explain the poor stats -- it's not the majority of the protein sequences.

Can you run the sift4g command

sift4g -d $PROTEIN_DB -q all_prot_fasta --subst $SUBST_DIR --out $TMP_DIR --sub-results

and paste the log/messages here

andrewSharo commented 2 years ago

True! I am running the command, and it will take at least 12 hours. I'll update you when it finishes.

andrewSharo commented 2 years ago

It ran faster than I expected. Here are all the output messages:

Checking query data and substitutions files

Searching database for candidate sequences

Using: lambda = 0.243, K = 0.024, H = 0.100 Aligning queries with candidate sequences

Selecting alignments with median threshold: 2.75

Generating SIFT predictions with sequence identity: 100.00%

I could not find any log files.

pauline-ng commented 2 years ago

Can you double check your protein database is complete (size matches what you expect)?

andrewSharo commented 2 years ago

I'm using the uniref90 database. The unzipped fasta file is 63Gb, which is the size I would expect.

andrewSharo commented 2 years ago

Hi Pauline, just bumping my last message in case you missed it.

pauline-ng commented 2 years ago

I'm pretty stumped. The only thing I can think of is run it on chicken as a sanity check.

We've precomputed Gallus gallus here , if your numbers come up similar, then we know it's not the SIFT installation.

andrewSharo commented 2 years ago

Hi Pauline, I have started to create a SIFT4G database for chicken, and so far it is taking much longer than for my organism Tympanuchus cupido. So I suspect it will score chicken fine. I ran BUSCO on my all_prot.fasta file from T. cupido, and the completeness scores were very poor. I suspect the gtf/gff annotation file (generated by a collaborator) is lower quality than I realized. I will try generating a new annotation file, which will take some time. But thank you for your help and I'll re-open if I discover anything useful.

pauline-ng commented 2 years ago

Great, I'm glad it's starting to make sense. Good luck!

sandyplus commented 2 years ago

Dear @pauline-ng and @andrewSharo ,

Finally, I figured out the problem. I think something is wrong with the script: make-single-records-BIOPERL.pl I guess sometimes it made mistakes for producing protein fasta files. I checked some proteins produced by this script. I compared the protein fasta to the one I produced by gffread manually (gffread file.gff3 -g genome.fa -y proteins.fa). I found that some proteins contained a lot of 'X', which were filtered out for analysis. I think this is the reason you got poor database statistics.

However, if I split my genome/gtf into single scaffolds, and run the scripts again, everything is OK. Finally, I improved my statistics from 33% to 97%.

PS is the code I used, hope it helps. Best regards. Sandy

PS:

#!/usr/bin/bash
#author: Yong Shi
#date: 20220601
#a script to run sift4g for a single chromesome
#useage: sh xx.sh HIC_ASM_11

mkdir -p /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/Ceyrei
cd $_

#initial variables for new chromesome
scaffold=$1 #HIC_ASM_1"
newfolder="Ceyrei_$scaffold"
newchrom=${scaffold}
newconfig=$newfolder.config.txt  #"Ceyrei_HIC_ASM_1.config.txt"
newlog=$newfolder.220601.log #"Ceyre_HIC_ASM_1.220601.log"

#delete old folders, and make new folders
rm -rf $newfolder
mkdir -p $newfolder;cd $_
mkdir -p chr-src;cd $_

#extract fasta for chromesome
seqkit grep --by-name -p $newchrom /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/for_copy/Ceyrei.soft.genome.fa.gz > Ceyrei_${newchrom}.genome.fa
gzip *.fa
cd ..

#subset gtf for chromesome
mkdir -p gene-annotation-src;cd $_
zcat /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/for_copy/Ceyrei_genes.gtf.gz |\
    grep -P "$newchrom\t" > Ceyrei_genes.$newchrom.gtf
gzip *.gtf
cd ../../
#make new config
sed "s:SOMEWHERE_TO_PARENT_DIR:/home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/Ceyrei/$newfolder:g" /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Geno
mic_DB-master/test_files/for_copy/Ceyrei_config.txt \
    > $newconfig

#check file list
#tree $newfolder

#source env and run sift
source /home/shiyong/software/SIFT4G/scripts_to_build_SIFT_db/env.sh
cd /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master
perl make-SIFT-db-all.pl -config test_files/Ceyrei/$newconfig 2>&1 |tee $newlog