Closed andrewSharo closed 2 years ago
Hi Andrew,
Let's check if the gtf file is properly being converted to protein sequences.
Can you run:
grep "^>" all_prot.fasta | wc -l
What is the # that you get? Is it similar to the number of expected proteins in the genome?
Thanks, Pauline
Hi Pauline,
Thanks for your quick response. After running the command you gave me, I get "18804". This is close to the number of proteins I would expect in a typical bird genome.
Best, Andrew
Andrew,
Hmmm... when you do a
more all_prot.fasta
and do a quick scan-through, these look like legitimate proteins (they don't contain X or in the middle -- where means stop )?
Very strange, this is a chicken and not an obscure organism, I expect your stats to be higher.
Best, Pauline
Hi Pauline,
Yes, it is strange. The greater prairie chicken is a few million years diverged from chicken, but I would still expect there would be many similar sequences between it and chickens.
I had not considered your recommendation before. Several of the proteins do contain two or more Xs in the middle. When I run
cat all_prot.fasta | grep X | wc -l
I get 7630. Scanning through, most of these Xs are in the middle of the protein, not the end. When I run
cat all_prot.fasta | grep "\*" | wc -l
I get 365. Scanning through, all of these * characters are at the beginning of the sequence.
Do you think this could explain the poor stats?
It wouldn't explain the poor stats -- it's not the majority of the protein sequences.
Can you run the sift4g command
sift4g -d $PROTEIN_DB -q all_prot_fasta --subst $SUBST_DIR --out $TMP_DIR --sub-results
and paste the log/messages here
True! I am running the command, and it will take at least 12 hours. I'll update you when it finishes.
It ran faster than I expected. Here are all the output messages:
Checking query data and substitutions files
Searching database for candidate sequences
Using: lambda = 0.243, K = 0.024, H = 0.100 Aligning queries with candidate sequences
Selecting alignments with median threshold: 2.75
Generating SIFT predictions with sequence identity: 100.00%
I could not find any log files.
Can you double check your protein database is complete (size matches what you expect)?
I'm using the uniref90 database. The unzipped fasta file is 63Gb, which is the size I would expect.
Hi Pauline, just bumping my last message in case you missed it.
I'm pretty stumped. The only thing I can think of is run it on chicken as a sanity check.
We've precomputed Gallus gallus here , if your numbers come up similar, then we know it's not the SIFT installation.
Hi Pauline, I have started to create a SIFT4G database for chicken, and so far it is taking much longer than for my organism Tympanuchus cupido. So I suspect it will score chicken fine. I ran BUSCO on my all_prot.fasta file from T. cupido, and the completeness scores were very poor. I suspect the gtf/gff annotation file (generated by a collaborator) is lower quality than I realized. I will try generating a new annotation file, which will take some time. But thank you for your help and I'll re-open if I discover anything useful.
Great, I'm glad it's starting to make sense. Good luck!
Dear @pauline-ng and @andrewSharo ,
Finally, I figured out the problem.
I think something is wrong with the script: make-single-records-BIOPERL.pl
I guess sometimes it made mistakes for producing protein fasta files.
I checked some proteins produced by this script. I compared the protein fasta to the one I produced by gffread manually (gffread file.gff3 -g genome.fa -y proteins.fa
). I found that some proteins contained a lot of 'X', which were filtered out for analysis. I think this is the reason you got poor database statistics.
However, if I split my genome/gtf into single scaffolds, and run the scripts again, everything is OK. Finally, I improved my statistics from 33% to 97%.
PS is the code I used, hope it helps. Best regards. Sandy
PS:
#!/usr/bin/bash
#author: Yong Shi
#date: 20220601
#a script to run sift4g for a single chromesome
#useage: sh xx.sh HIC_ASM_11
mkdir -p /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/Ceyrei
cd $_
#initial variables for new chromesome
scaffold=$1 #HIC_ASM_1"
newfolder="Ceyrei_$scaffold"
newchrom=${scaffold}
newconfig=$newfolder.config.txt #"Ceyrei_HIC_ASM_1.config.txt"
newlog=$newfolder.220601.log #"Ceyre_HIC_ASM_1.220601.log"
#delete old folders, and make new folders
rm -rf $newfolder
mkdir -p $newfolder;cd $_
mkdir -p chr-src;cd $_
#extract fasta for chromesome
seqkit grep --by-name -p $newchrom /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/for_copy/Ceyrei.soft.genome.fa.gz > Ceyrei_${newchrom}.genome.fa
gzip *.fa
cd ..
#subset gtf for chromesome
mkdir -p gene-annotation-src;cd $_
zcat /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/for_copy/Ceyrei_genes.gtf.gz |\
grep -P "$newchrom\t" > Ceyrei_genes.$newchrom.gtf
gzip *.gtf
cd ../../
#make new config
sed "s:SOMEWHERE_TO_PARENT_DIR:/home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master/test_files/Ceyrei/$newfolder:g" /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Geno
mic_DB-master/test_files/for_copy/Ceyrei_config.txt \
> $newconfig
#check file list
#tree $newfolder
#source env and run sift
source /home/shiyong/software/SIFT4G/scripts_to_build_SIFT_db/env.sh
cd /home/shiyong/software/SIFT4G_220529/SIFT4G_Create_Genomic_DB-master
perl make-SIFT-db-all.pl -config test_files/Ceyrei/$newconfig 2>&1 |tee $newlog
Hi Pauline,
Thanks for developing this resource to create SIFT4G databases for any organism. It's quite easy to use. I am annotating a new bird genome, the Greater prairie chicken (Tympanuchus cupido). Unfortunately, my database statistics are quite low. When I use UniRef90 as the protein database, I get the following overall stats:
ALL 23 (4351/18874) 39 (16156279/41401234) 15(2441297/16156279)
I thought that maybe a larger database could help, so I tried using the ncbi non-redundant database, but that actually resulted in even worse statistics:
ALL 19 (3572/18874) 33 (13754937/41401234) 6(890807/13754937)
I started with a gff file, and used gffread to convert to gtf. I did not personally create the gff file, but it was created with Maker based on transcriptome data. I am not sure how to assess the quality of this gff.
Do you have any ideas for how I could improve my overall SIFT database statistics on this organism? When I ran the test cases, I found very high database statistics, so I believe I am running everything correctly. I've pasted my config file below.
Best, Andrew
GENETIC_CODE_TABLE=1 GENETIC_CODE_TABLENAME=Standard MITO_GENETIC_CODE_TABLE=2 MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial
PARENT_DIR=./test_files/Tympanuchus_cupido_full ORG=Typanuchus_cupido ORG_VERSION=satsuma
DBSNP_VCF_FILE=
Running SIFT 4G
SIFT4G_PATH=/redser4/personal/andrew/src/sift4g/bin/sift4g PROTEIN_DB=/redser4/personal/andrew/uniprot/uniref90.fasta
Sub-directories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src CHR_DOWNLOAD_DEST=chr-src LOGFILE=Log.txt ZLOGFILE=Log2.txt FASTA_DIR=fasta SUBST_DIR=subst ALIGN_DIR=SIFT_alignments SIFT_SCORE_DIR=SIFT_predictions SINGLE_REC_BY_CHR_DIR=singleRecords SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores DBSNP_DIR=dbSNP
Doesn't need to change
FASTA_LOG=fasta.log INVALID_LOG=invalid.log PEPTIDE_LOG=peptide.log ENS_PATTERN=ENS SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord