pcantalupo / annotater

Annotate sequences by BLAST using NCBI taxonomy information
MIT License
6 stars 0 forks source link

getting integer values for the Description column in Report.txt #2

Closed pcantalupo closed 6 years ago

pcantalupo commented 6 years ago

When using BLAST+ => 2.5.0, the Description values in the Report are integer values. When I use BLAST+ <= 2.3.0 (didn't try 2.4), the Descriptions are as I expect (i.e. Tobacco mosaic virus, complete genome)

This stems from a change in how blastdbcmd extracts fasta sequences from the blast database. When this fullgi gi|9626125|ref|NC_001367.1| is used with blastdbcmd, it extracts the following

for BLAST >= 2.5.0

blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>NC_001367.1 Tobacco mosaic virus, complete genome

for BLAST <= 2.3.0

$ blastdbcmd -db viral.1.1.genomic -entry "gi|9626125|ref|NC_001367.1|" | head -n1
>gi|9626125|ref|NC_001367.1| Tobacco mosaic virus, complete genome

Then when Reann.pm reads the fasta file it uses the fasta identifier, which is an ACC.VER for BLAST+ >= 2.5, to update the %acc hash. With ACC.VER, it is creating a new hash key with the description. This new key (the ACC.VER) is never accessed again because later in the code, the fullgi from the Report file (the Accession field) is used to look up what is expected to be the Description. However, the hash value is the number of times the fullgi was found in the Report file (due to this line)

To test, use sewageseqs16.fa with

BLAST+ 2.3.0 (Descriptions are OK)

$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.3.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
        Getting fasta seqs for viral.1.1.genomic
        Skipping LocalTaxonomy
temp file is /tmp/3yjYOMnh2E
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID   accession       desc
All.viralseqs.fa.Contig1000     gi|9626125|ref|NC_001367.1|     Tobacco mosaic virus, complete genome
All.viralseqs.fa.Contig1002     gi|764159063|ref|NC_026582.1|   Uncultured phage WW-nAnB, complete genome
All.viralseqs.fa.Contig1006     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1007     gi|9630643|ref|NC_001918.1|     Aichi virus, complete genome
All.viralseqs.fa.Contig1008     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1        gi|1043372815|ref|NC_030457.1|  Circovirus-like genome DCCV-8, complete genome
All.viralseqs.fa.Contig10       gi|20260796|ref|NC_003692.1|    Pariacato virus chromosome RNA2, complete genome
All.viralseqs.fa.Contig100      gi|45476493|ref|NC_005817.1|    Sclerophthora macrospora virus A RNA 1, complete sequence
All.viralseqs.fa.Contig1001     gi|1127298492|ref|NC_032551.1|  Beihai picorna-like virus 77 strain BHBei75652 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
All.viralseqs.fa.Contig1003     gi|20177424|ref|NC_003630.1|    Pepper mild mottle virus, complete genome
All.viralseqs.fa.Contig1004     gi|1127298656|ref|NC_032614.1|  Beihai picorna-like virus 74 strain BHHK130969 hypothetical protein 1 and hypothetical protein 2 genes, complete cds
FTWLCJP02IU3XZ  gi|593779948|ref|NC_023737.1|   Mycobacterium phage Pleione, complete genome
FTWLCJP02IU4PY  gi|203454602|ref|NC_011273.1|   Mycobacterium phage Myrna, complete genome
FTWLCJP02IUB6X  gi|197085614|ref|NC_011142.1|   Iodobacteriophage phiPLPE, complete genome
FTWLCJP02IUGFL
FTWLCJP02IUNP9

BLAST+ 2.6.0 (Descriptions are not ok; integer values)

$ export PATH=/home/paul/local/usr/local/ncbi-blast-2.6.0+/bin/:$PATH
$ rm -rf annotator
$ Reann.pl -file ../sewageseqs16.fa -config sewageseqs.conf -num_threads 8 -evalue 1e-5
sewageseqs.conf
ann.0.fasta
tblastx -show_gis -num_threads 8 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs qcovhsp" -db viral.1.1.genomic -evalue 1e-5 -query ann.0.fasta -out ann.0.0.tblastx
Report
Taxonomy
        Getting fasta seqs for viral.1.1.genomic
        Skipping LocalTaxonomy
temp file is /tmp/JkhItzr1AK
$ cut -f 1,7,8 annotator/ann.wTax.report.txt
seqID   accession       desc
All.viralseqs.fa.Contig1000     gi|9626125|ref|NC_001367.1|     1
All.viralseqs.fa.Contig1002     gi|764159063|ref|NC_026582.1|   1
All.viralseqs.fa.Contig1006     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1007     gi|9630643|ref|NC_001918.1|     1
All.viralseqs.fa.Contig1008     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1        gi|1043372815|ref|NC_030457.1|  1
All.viralseqs.fa.Contig10       gi|20260796|ref|NC_003692.1|    1
All.viralseqs.fa.Contig100      gi|45476493|ref|NC_005817.1|    1
All.viralseqs.fa.Contig1001     gi|1127298492|ref|NC_032551.1|  1
All.viralseqs.fa.Contig1003     gi|20177424|ref|NC_003630.1|    3
All.viralseqs.fa.Contig1004     gi|1127298656|ref|NC_032614.1|  1
FTWLCJP02IU3XZ  gi|593779948|ref|NC_023737.1|   1
FTWLCJP02IU4PY  gi|203454602|ref|NC_011273.1|   1
FTWLCJP02IUB6X  gi|197085614|ref|NC_011142.1|   1
FTWLCJP02IUGFL
FTWLCJP02IUNP9
pcantalupo commented 6 years ago

Adding the std keyword fixed the integer value problem in the desc field (for BLAST 2.6.0) but then blastentropy.pl was encountering an error. This was fixed by using the latest viral.1.1.genomic database which only contains ACC.VER fasta ids.