nigyta / dfast_core

DDBJ Fast Annotation and Submission Tool
77 stars 14 forks source link

missing reference in pseudogene detection #17

Closed maximilianpress closed 5 years ago

maximilianpress commented 5 years ago

I am running on Amazon Linux (CentOS) and encountered an error with a specific assembly. I checked other assemblies and they complete ok without this issue on the same instance.

It looks like something is missing from one of the reference databases but it's hard for me to figure out what the issue is. I have installed with conda as laid out in the readme. My understanding was that this includes all necessary databases. However, I also attempted to install the databases manually as outlined in the readme and the problem persisted.

Can you see if anything is wrong here in the log below? Thanks.

$ dfast --genome spades_asms/sample1_1864_contigs.fasta --out spades_annot/sample1_1864_2 --cpu 2
2019/09/03 17:57:00 Running on Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0].
2019/09/03 17:57:00 Loading a config file from /home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/default_config.py
2019/09/03 17:57:00 OS type is Linux.
2019/09/03 17:57:00 Adding /home/ec2-user/miniconda3/opt/dfast-1.2.3/bin/Linux to PATH.
2019/09/03 17:57:00 DFAST pipeline started. (version 1.2.3)
2019/09/03 17:57:00 Results will be generated into 'spades_annot/sample1_1864_2'.
2019/09/03 17:57:00 Loading a genomic fasta file from spades_asms/sample1_1864_contigs.fasta
2019/09/03 17:57:00 Genome source information: organism=, strain=
2019/09/03 17:57:01 The query genome is treated as a draft genome with 27457 sequences.
2019/09/03 17:57:01 Sequences are sorted by length (from longer to shorter).
2019/09/03 17:57:01 Sequences shorter than 200 will be eliminated.
2019/09/03 17:57:01 Sequences will be renamed as sequence00001, sequence00002...
2019/09/03 17:57:01 Locus_tag settings: locus_tag_prefix=LOCUS and step=10.
2019/09/03 17:57:01 Locus_tags are assigned separately to each feature type. e.g. CDS: LOCUS_000xx, rRNA: LOCUS_r000xx, tRNA: LOCUS_t000xx, tmRNA: LOCUS_tm000xx.
2019/09/03 17:57:01 Remove_Partial_Feature is enabled.
2019/09/03 17:57:01 Remove_Overlapping_Feature is enabled. Priority: ['assembly_gap', 'CRISPR', ('tmRNA', 'tRNA', 'rRNA'), 'CDS']
2019/09/03 17:57:01 Initializing structural annotation tools... 
2019/09/03 17:57:01 GAPannotator initialized. (Version 1.0)
2019/09/03 17:57:01 Setting GAPannotator options. {'len_cutoff': 5, 'linkage_evidence': 'paired-ends', 'gap_type': 'within scaffold'}
2019/09/03 17:57:01 Checking MetaGeneAnnotator... 
2019/09/03 17:57:02 MetaGeneAnnotator initialized. (Version 2008/08/19)
2019/09/03 17:57:02 Setting MetaGeneAnnotator options. {'cmd_options': '-s'}
2019/09/03 17:57:02 Checking Aragorn version... 
2019/09/03 17:57:02 Aragorn initialized. (Version 1.2.38)
2019/09/03 17:57:02 Setting Aragorn options. {'gcode': '-gcbact', 'cmd_options': '-l', 'transl_table': 11}
2019/09/03 17:57:02 Checking Barrnap version... 
2019/09/03 17:57:02 Barrnap initialized. (Version 0.9)
2019/09/03 17:57:02 Checking CRT version... 
2019/09/03 17:57:02 CRT initialized. (Version 1.2)
2019/09/03 17:57:02 Setting CRT options. {'jar_file': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/bin/common/CRT1.2-CLI.jar', 'java_options': '', 'cmd_options': ''}
2019/09/03 17:57:02 Checking Java version... 
2019/09/03 17:57:04 Java initialized. (Version 1.8.0_152-release)
2019/09/03 17:57:04 Initializing annotation components... 
2019/09/03 17:57:04 DBsearch initialized.
2019/09/03 17:57:04 Setting DBsearch options. {'skipAnnotatedFeatures': True, 'evalue_cutoff': 1e-06, 'qcov_cutoff': 75, 'scov_cutoff': 75, 'pident_cutoff': 0, 'aligner': 'ghostx', 'aligner_options': {}, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/protein/DFAST-default.ref', 'db_name': ''}
2019/09/03 17:57:04 Checking GHOSTX version... 
2019/09/03 17:57:04 GHOSTX initialized. (Version 1.3.6)
2019/09/03 17:57:13 Reference DB loaded. 404804 sequences. [dbname=DFAST-default, version=1.1.4, contributor=yt, modified=20180612]
2019/09/03 17:57:13 PseudoGeneDetection initialized.
2019/09/03 17:57:13 Setting PseudoGeneDetection options. {'cpu': 1, 'skipAnnotatedFeatures': False, 'extension': 300, 'scov_cutoff': 85, 'transl_table': 11}
2019/09/03 17:57:13 Checking lastdb version... 
2019/09/03 17:57:13 lastdb initialized. (Version 959)
2019/09/03 17:57:13 Setting lastdb options. {'cpu': 1, 'skipAnnotatedFeatures': False, 'extension': 300, 'scov_cutoff': 85, 'transl_table': 11}
2019/09/03 17:57:13 Checking lastal version... 
2019/09/03 17:57:13 lastal initialized. (Version 959)
2019/09/03 17:57:13 Setting lastal options. {'cpu': 1, 'skipAnnotatedFeatures': False, 'extension': 300, 'scov_cutoff': 85, 'transl_table': 11}
2019/09/03 17:57:13 Checking Blastp version... 
2019/09/03 17:57:13 Blastp initialized. (Version 2.5.0)
2019/09/03 17:57:13 HMMscan initialized.
2019/09/03 17:57:13 Setting HMMscan options. {'skipAnnotatedFeatures': True, 'evalue_cutoff': 1e-06, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/hmm/TIGRFAMs_15.0_HMM.LIB', 'db_name': 'TIGR'}
2019/09/03 17:57:13 Checking hmmscan version... 
2019/09/03 17:57:13 hmmscan initialized. (Version 3.2.1)
2019/09/03 17:57:13 Setting hmmscan options. {'skipAnnotatedFeatures': True, 'evalue_cutoff': 1e-06, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/hmm/TIGRFAMs_15.0_HMM.LIB', 'db_name': 'TIGR'}
2019/09/03 17:57:13 CDDsearch initialized.
2019/09/03 17:57:13 Setting CDDsearch options. {'skipAnnotatedFeatures': False, 'evalue_cutoff': 1e-06, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/cdd/Cog', 'rpsbproc_data': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/bin/common/rpsbproc_data'}
2019/09/03 17:57:13 Checking RPSblast version... 
2019/09/03 17:57:14 RPSblast initialized. (Version 2.5.0)
2019/09/03 17:57:14 Setting RPSblast options. {'skipAnnotatedFeatures': False, 'evalue_cutoff': 1e-06, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/cdd/Cog', 'rpsbproc_data': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/bin/common/rpsbproc_data'}
2019/09/03 17:57:14 Checking rpsbproc version... 
2019/09/03 17:57:15 rpsbproc initialized. (Version 0.11)
2019/09/03 17:57:15 Setting rpsbproc options. {'skipAnnotatedFeatures': False, 'evalue_cutoff': 1e-06, 'database': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/db/cdd/Cog', 'rpsbproc_data': '/home/ec2-user/miniconda3/opt/dfast-1.2.3/bin/common/rpsbproc_data'}
2019/09/03 17:57:15 Start structural annotation process using 2 CPUs
2019/09/03 17:57:44 0 assembly_gap features were detected by GAPannotator.
2019/09/03 17:57:49 41533 CDS features were detected by MetaGeneAnnotator.
2019/09/03 17:57:49 Skipped an ambiguous tRNA-??? at sequence00063:20335..20409(-1).
2019/09/03 17:57:49 Skipped an ambiguous tRNA-??? at sequence05162:415..487(1).
2019/09/03 17:57:49 Skipped an ambiguous tRNA-??? at sequence22905:53..142(-1).
2019/09/03 17:57:49 281 tRNA features were detected by Aragorn.
2019/09/03 17:57:49 32 rRNA features were detected by Barrnap.
2019/09/03 17:57:49 8 CRISPR (later will be replaced by repeat_region) features were detected by CRT.
2019/09/03 17:58:07 Removed 29399 partial features.
2019/09/03 17:58:08 Removed 13 overlapping features.
2019/09/03 17:58:08 Start executing functional annotation components.
2019/09/03 17:58:08 DBsearch will be performed using 2 CPUs.
2019/09/03 18:01:38 DBsearch done 2/2.
2019/09/03 18:01:40 DBsearch done 1/2.
2019/09/03 18:01:40 Last format DB will be performed using 1 CPUs.
2019/09/03 18:01:40 Last format DB done 1/1.
2019/09/03 18:01:40 Last alignment will be performed using 1 CPUs.
2019/09/03 18:01:42 Last alignment done 1/1.
2019/09/03 18:01:43 197 CDS features were marked as possible pseudo due to internal stop codons.
2019/09/03 18:01:43 232 CDS features were marked as possible pseudo due to frameshift.
2019/09/03 18:01:43 Internal stop codons in the following 1 CDSs are translated to selenosysteine/pyrrolysine, which will be annotated with 'transl_except'. ['MGA_3213']
2019/09/03 18:01:43 Removed 1 CDSs. ['MGA_3212']
Traceback (most recent call last):
  File "/home/ec2-user/miniconda3/bin/dfast", line 272, in <module>
    pipeline.execute()
  File "/home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/pipeline.py", line 54, in execute
    self.fa.execute()  # functional annotation
  File "/home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/functionalAnnotation.py", line 72, in execute
    component.run()
  File "/home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/components/PseudoGeneDetection.py", line 472, in run
    self.execute_blast(targets=proteins_with_transl_except)
  File "/home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/components/PseudoGeneDetection.py", line 462, in execute_blast
    _parse_result(blast_result_file)
  File "/home/ec2-user/miniconda3/opt/dfast-1.2.3/dfc/components/PseudoGeneDetection.py", line 432, in _parse_result
    protein = self.references[s_id]
KeyError: 'NP_458584'
nigyta commented 5 years ago

@maximilianpress Thank you for using DFAST.

You are correct. All the necessary databases are automatically downloaded if you install DFAST via conda. Judging from the number of sequences loaded from the reference database (404804), the reference database is properly set up.

Another concern is that the numbers of contigs (27457) and predicted protein-coding genes (41533) seem too large for an assembled bacterial genome. Is it an assembly obtained from metagenomic data? I think DFAST can work with it, but an unexpected error may happen.

Still I don't have any idea why this happened, but if you can share the genome data with me (Please send it to ytanizaw@nig.ac.jp), I will look into it.

Yasuhiro

mnapolitano89 commented 5 years ago

Hi Yasuhiro, I've run into the same problem and have been working to narrow it down, here's what I've got so far:

mnapolitano89 commented 5 years ago

I suspect this is something that blast fmtdb is doing to be helpful, but it should be an easish fix. Not sure why it only occurs in conda installs though, maybe has to do with the databases are are being pulled?

mnapolitano89 commented 5 years ago

This obviously isn't ideal, but replacing protein = self.references[s_id] in PseudoGeneDetection.execute_blast._parse_result with:

try:
    protein = self.references[s_id]
except KeyError:
    for key in self.references.keys():
        if s_id in key:
            protein = self.references[key]

temporarily patches out the problem

maximilianpress commented 5 years ago

@mnapolitano89 thanks for digging more into this!

@nigyta unfortunately the data are proprietary and I cannot share. But I would be interested to know if @mnapolitano89 's information is enough for a fix as I hope to use DFAST again. The assembly is unfortunately an isolate assembled with a metagenomic assembler (I am very aware of the potential problems here!), so it is unsurprising if it is acting weird. But it looks like the issue is something else.

nigyta commented 5 years ago

@mnapolitano89 @maximilianpress Thank you very much for the detailed report.

The error might be due to the version difference of BLAST executables. Although I still cannot reproduce the error in my environment, the following procedure may fix the problem.

First, download BLAST executables (ver2.6) from the DFAST repository.

$ wget https://github.com/nigyta/dfast_core/raw/master/bin/Linux/blastdbcmd
$ wget https://github.com/nigyta/dfast_core/raw/master/bin/Linux/blastp
$ wget https://github.com/nigyta/dfast_core/raw/master/bin/Linux/makeblastdb

Then, after adding execute permission as following

$ chmod a+x blastdbcmd blastp makeblastdb

move them to /home/USER/miniconda3/opt/dfast-X.X.X/bin/Linux/

$ mv blastdbcmd blastp makeblastdb /home/USER/miniconda3/opt/dfast-X.X.X/bin/Linux/

(The destination directory varies depending on where anaconda/miniconda is installed.) If you use Mac, please relace "Linux" in the above commands with "Darwin".

When running DFAST, it inserts /home/USER/miniconda3/opt/dfast-X.X.X/bin/Linux/ at the head of PATH environmetal variable, so ver2.6 executables will be called preferentially.

I hope this helps.

Yasuhiro

mnapolitano89 commented 5 years ago

Moving from v2.5 to v2.6 seems to solve it - is it possible to update the conda recipe to require/install v2.6?

nigyta commented 5 years ago

Yes, but the build test fails for some reason, and I'm still working on it. Unfortunately, I'll be out for next week, so it may take time until I fix this issue.

For now, a good option is to update blast by

conda update -c bioconda blast
nigyta commented 5 years ago

@mnapolitano89 @maximilianpress Sorry for the late response. I have updated the bioconda recipe. Now it requires NCBI-Blast v2.6 or later, so I think this issue will be fixed by conda updata -c bioconda dfast or conda install -c bioconda dfast.

maximilianpress commented 5 years ago

thanks!