merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
436 stars 144 forks source link

External gene calls not recognized when using anvi-gen-genomes-storage #587

Closed vnsriniv closed 7 years ago

vnsriniv commented 7 years ago

Hi Anvi'o team,

I am trying to perform some comparative genomics (published genomes/MAGs) using anvio. My pipeline is as follows

  1. Download draft genomes/MAGs from various archives.
  2. Reformat fasta file using anvi-script-reformat-fasta
  3. Generate gene calls and annotations using Prokka.
  4. Parse the gff files using gff_parser.py.
  5. Generate contigs database with external gene calls and import annotations using anvi-import-functions.
  6. Generate hmms and blank profiles for each contigs database.
  7. Create a genomes storage database for each genome.

For steps 3-5 I following the Importing functions from Prokka tutorial.

While using anvi-gen-genomes-storage, it is having trouble recognizing gene calls. Is this normal and to be expected with external gene calls and annotations? Any help in this matter would be very much appreciated! I have attached the output for the command below!

Thanks Varun

anvi-gen-genomes-storage -e external_genomes.txt -o MY-GENOMES.h5

WARNING

Good news! Anvi'o found all these functions that are common to all of your genomes and will use them for downstream analyses and is very proud of you: 'Prokka:Prodigal'.

Internal genomes .............................: 0 have been initialized.
[08 Sep 17 10:04:02 Initializing external genomes] working on CAP_HKU2_Mao2014 /home/vnsriniv/virtual-envs/anvio-dev/local/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /home/vnsriniv/virtual-envs/anvio-dev/local/lib/python3.5/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) External genomes .............................: 6 found.

The new genomes storage ......................: MY-GENOMES.h5 (v4, signature: 404c1b9f) Number of genomes ............................: 6 (internal: 0, external: 6) Number of gene calls .........................: 0 Number of partial gene calls .................: 0

meren commented 7 years ago

Hi Varun,

I had a chance to work on this today, and I realized that this in fact is not a bug, but you couldn't see what was going on, because the reporting wasn't as clear as it should have been. So I fixed it with the commits above. So now when you have this situation, anvi'o will explain what's up more clearly. Please update your codebase to master if you can, and try again.

But you can re-run your things without updating to master, and everything would work if you simply use the --gene-caller parameter with anvi-gen-genomes-storage program. The gene caller name you will use with that paramter should match the gene caller name in your external gene calls file.

Please let me know if it solves your issue.

Best,

vnsriniv commented 7 years ago

Hi Meren,

Thanks! I updated it to master and it worked great with the --gene-caller parameter! The reporting in the absence of the --gene-caller parameter is also abundantly clear.

Varun

meren commented 7 years ago

Thank you very much for trying! I'm glad it worked.

jtremblay commented 3 years ago

Dear Anvio developpers, I encountered a similar issue with version 7. Basically I've pretty much imported all annotations from our production pipeline. Our pipeline also uses Prodigal v2.6.3 to call genes and these gene calls were imported into our contigs db. anvi-db-info ./contigs.db

DB Info (no touch)
===============================================
Database Path ................................: ./contigs.db
Description ..................................: [Not found, but it's OK]
Type .........................................: contigs
Version ......................................: 20

DB Info (no touch also)
===============================================
project_name .................................: megahit_bwa
contigs_db_hash ..............................: hash41683284
split_length .................................: 20000
kmer_size ....................................: 4
num_contigs ..................................: 1345272
total_length .................................: 1899811878
num_splits ...................................: 1347502
genes_are_called .............................: 1
external_gene_calls ..........................: 1
external_gene_amino_acid_seqs ................: 1
skip_predict_frame ...........................: 1
splits_consider_gene_calls ...................: 1
scg_taxonomy_was_run .........................: 0
scg_taxonomy_database_version ................: None
trna_taxonomy_was_run ........................: 0
trna_taxonomy_database_version ...............: None
creation_date ................................: 1621362887.25042
gene_level_taxonomy_source ...................: default_matrix
gene_function_sources ........................: KEGG_KO,KEGG_pathway,COG,COG_category,COG_name,KEGG_module

* Please remember that it is never a good idea to change these values. But in some
cases it may be absolutely necessary to update something here, and a programmer
may ask you to run this program and do it. But even then, you should be
extremely careful.

AVAILABLE GENE CALLERS
===============================================
* 'prodigal' (2,693,420 gene calls)
* 'barrnap' (1,895 gene calls)
* 'Ribosomal_RNA_28S' (93 gene calls)
* 'Ribosomal_RNA_23S' (344 gene calls)
* 'Ribosomal_RNA_18S' (35 gene calls)
* 'Ribosomal_RNA_16S' (184 gene calls)

AVAILABLE HMM SOURCES
===============================================
* 'Archaea_76' (type: singlecopy; num genes: 76)
* 'Bacteria_71' (type: singlecopy; num genes: 71)
* 'Protista_83' (type: singlecopy; num genes: 83)
* 'Ribosomal_RNA_12S' (type: Ribosomal_RNA_12S; num genes: 1)
* 'Ribosomal_RNA_16S' (type: Ribosomal_RNA_16S; num genes: 3)
* 'Ribosomal_RNA_18S' (type: Ribosomal_RNA_18S; num genes: 1)
* 'Ribosomal_RNA_23S' (type: Ribosomal_RNA_23S; num genes: 2)
* 'Ribosomal_RNA_28S' (type: Ribosomal_RNA_28S; num genes: 1)
* 'Ribosomal_RNA_5S' (type: Ribosomal_RNA_5S; num genes: 5)

However when I do : anvi-gen-genomes-storage --gene-caller prodigal -i ./internal_genomes_test.txt -o DEPTH-GENOMES.db It still tells me the following error msg:

Config Error: None of your genomes seem to have a gene call, which is a typical error you get
              if you are working with contigs databases with external gene calls. You can    
              solve it by looking at the output of the program `anvi-db-info` for a given    
              contigs database in your collection, and use one of the gene caller sources    
              listed in the output using the `--gene-caller` parameter.  

Is this because I used prodigal as a gene caller, but "externally" to anvio, that the --gene-caller argument does not seem to work? Would there be an (easy) way around this? Thanks very much,

meren commented 3 years ago

Hi @jtremblay,

I wonder if we addressed this problem in our active repository. Would you mind trying this after switching to the active branch? Otherwise you can privately send me two example FASTA files with external gene calls for them so I can run anvi-gen-contigs-database and anvi-gen-genomes-storage to see if I can reproduce the problem :/

jtremblay commented 3 years ago

Hi @meren , Thanks for the input - okay I've finally tried with the latest master branch and I now get an error when generating my profiles (anvi-profile). I've regenerated the complete workflow (including the contigs.db file) from scratch.

^M                                                                                ^M^[[48;5;6m^[[38;5;0m^M[05 Oct 21 21:15:35 Loading the contigs DB] ...                                ^[[0m^M    
^M                                                                                ^M^[[48;5;6m^[[38;5;0m^M[05 Oct 21 21:18:45 Loading contig sequences] Reading **ALL** data from ` (...)^[[0m^M    
WARNING
=====================================
Amino acid linkmer frequencies will not be characterized for this profile.

anvio ..............................: 7-dev
profiler_version ...................: 37
sample_id ..........................: s395100
description ........................: None
profile_db .........................: /gpfs/fs1/nrc/eme/jut001/projects/DFO/gulley-haddock/megahit_bwa/anvio/profiles/395100/PROFILE.db
contigs_db .........................: True
contigs_db_hash ....................: hash3f59231d
cmd_line ...........................: /space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/bin/anvi-profile -T 12 -i contig_abundance/395100/395100.bam -c anvio/contigs.db -o anvio/profiles/395100
merged .............................: False
blank ..............................: False
split_length .......................: 20,000
min_contig_length ..................: 1,000
max_contig_length ..................: 9,223,372,036,854,775,807
min_mean_coverage ..................: 0
clustering_performed ...............: False
min_coverage_for_variability .......: 10
skip_SNV_profiling .................: False
skip_INDEL_profiling ...............: False
profile_SCVs .......................: False
min_percent_identity ...............: None
report_variability_full ............: False

WARNING
=====================================
Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take
into consideration anything below that. If you need to kill this an restart your
analysis with another minimum contig length value, feel free to press CTRL+C.

^M                                                                                ^M^[[48;5;6m^[[38;5;0m^M[05 Oct 21 21:19:02 Init] Reading BAM File                                     ^[[0m^M
output_dir .........................: /gpfs/fs1/nrc/eme/jut001/projects/DFO/gulley-haddock/megahit_bwa/anvio/profiles/395100
num_reads_in_bam ...................: 27,881,769
num_contigs ........................: 1,345,272
num_contigs_after_M ................: 480,000
num_splits .........................: 482,230
total_length .......................: 1,310,007,814
^M                                                                                ^M^[[48;5;6m^[[38;5;0m^M[05 Oct 21 21:20:13 Profiling w/12 threads] initializing threads ... ETA: ∞:∞:∞^[[0m^M
✖ anvi-profile encountered an error after 0:21:36.790592
Traceback (most recent call last):
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/bin/anvi-profile", line 4, in <module>
    __import__('pkg_resources').run_script('anvio==7.dev0', 'anvi-profile')
  File "/fs/isi-nas1/grdi/grdi_eco/bioinfo-tools/nrc/software/python/python-3.6.5/venv_anvio_7.0_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/fs/isi-nas1/grdi/grdi_eco/bioinfo-tools/nrc/software/python/python-3.6.5/venv_anvio_7.0_dev/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/fs/isi-nas1/grdi/grdi_eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/EGG-INFO/scripts/anvi-profile", line 102, in <module>
    main(args)
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/terminal.py", line 875, in wrapper
    program_method(*args, **kwargs)
  File "/fs/isi-nas1/grdi/grdi_eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/EGG-INFO/scripts/anvi-profile", line 39, in main
    profiler.BAMProfiler(args)._run()
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/profiler.py", line 520, in _run
    self.profile_multi_thread()
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/profiler.py", line 1162, in profile_multi_thread
    self.store_contigs_buffer()
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/profiler.py", line 1240, in store_contigs_buffer
    append_mode=True)
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/tables/views.py", line 94, in create_new_view
    self.sanity_check(view_data)
  File "/space/project/grdi/eco/bioinfo-tools/nrc/software/anvio/anvio-dev/lib/python3.6/site-packages/anvio-7.dev0-py3.6.egg/anvio/tables/views.py", line 43, in sanity_check
    if len(view_data[0]) != 3:
IndexError: list index out of range

Any ideas? Thank you!

meren commented 3 years ago

Hey @jtremblay,

Looking at the code, I have no idea how can the code end up failing that sanity check. I just generated a contigs database with external gene calls, profiled a BAM file using it with and without multi-threading. Everything worked as expected :/

I am sorry I am not useful, but please let me know if you realize something important and/or if you have specific questions.

jtremblay commented 3 years ago

Well that's odd, I've lowered the num threads to 1 and it seems to work fine so far. Maybe a compute node threading related issue...idk. I'll try with additional -T values and let you know if I ever notice something useful. Thank you for your time!

jtremblay commented 2 years ago

Hi @meren, Well its not the number of threads after all. When I use a subsample of the same dataset it works fine though regardless of the number of threads. So I've finally went back with anvio 7.0 on another cluster (fresh install) and this time it worked fine...(!) - both anvi-profile and anvi-gen-genomes-storage. I just don't know what to say... But thanks for your time!