merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

[BUG] HMMs that span through multiple splits explodes anvi'o if some splits are excluded during binning / refinement #1593

Closed meren closed 3 years ago

meren commented 3 years ago

Jon Sanders run into the following issue during anvi-summarize and sent the following traceback to anvi'o Slack:

[09 Dec 20 09:32:25 Summarizing 104 of 159: 'Bin_3_32'] Creating the FASTA file ...                                      ETA: 1m18sTraceback (most recent call last):
  File "/workdir/lab_envs/anvio/bin/anvi-summarize", line 118, in <module>
    main(args)
  File "/workdir/lab_envs/anvio/bin/anvi-summarize", line 63, in main
    summary.process()
  File "/workdir/lab_envs/anvio/lib/python3.6/site-packages/anvio/summarizer.py", line 874, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/workdir/lab_envs/anvio/lib/python3.6/site-packages/anvio/summarizer.py", line 1273, in create
    self.store_sequences_for_hmm_hits()
  File "/workdir/lab_envs/anvio/lib/python3.6/site-packages/anvio/summarizer.py", line 1533, in store_sequences_for_hmm_hits
    hmm_sequences_dict = s.get_sequences_dict_for_hmm_hits_in_splits({self.bin_id: self.split_names})
  File "/workdir/lab_envs/anvio/lib/python3.6/site-packages/anvio/hmmops.py", line 336, in get_sequences_dict_for_hmm_hits_in_splits
    hmm_hit = self.hmm_hits[split_entry['hmm_hit_entry_id']]
KeyError: 17702

He mentioned the following:

This goes away if I anvi-delete-hmms and run anvi-summarize again, but comes back if I re-run anvi-run-hmms. It always seems to fail on the same bin. So curious! Any ideas?

Switching to the main branch did not solve the problem, either. He kindly sent the files through e-mail for me to reproduce this bug. I will document my journey with these data here.


I am now looking at these databases, and while I can see at least one thing that is quite wrong here, I can’t figure out how it could have happened without destroying everything else: the information in collections_info table is incompatible with the actual data in the concoct collection. When I run the following command,

anvi-show-collections-and-bins -p PROFILE.db

this is the output I get for the collection concoct:

Collection: "concoct"
===============================================
Collection ID ................................: concoct
Number of bins ...............................: 103
Number of splits described ...................: 25,506
Bin names ....................................: Bin_0, Bin_1, Bin_10, Bin_100, Bin_101, Bin_102, Bin_11, Bin_12, Bin_13, Bin_14, Bin_15, Bin_16, Bin_17, Bin_18, Bin_19, Bin_2, Bin_20, Bin_21, Bin_22,
                                                Bin_23, Bin_24, Bin_25, Bin_26, Bin_27, Bin_28, Bin_29, Bin_3, Bin_30, Bin_31, Bin_32, Bin_33, Bin_34, Bin_35, Bin_36, Bin_37, Bin_38, Bin_39, Bin_4,
                                                Bin_40, Bin_41, Bin_42, Bin_43, Bin_44, Bin_45, Bin_46, Bin_47, Bin_48, Bin_49, Bin_5, Bin_50, Bin_51, Bin_52, Bin_53, Bin_54, Bin_55, Bin_56, Bin_57,
                                                Bin_58, Bin_59, Bin_6, Bin_60, Bin_61, Bin_62, Bin_63, Bin_64, Bin_65, Bin_66, Bin_67, Bin_68, Bin_69, Bin_7, Bin_70, Bin_71, Bin_72, Bin_73, Bin_74,
                                                Bin_75, Bin_76, Bin_77, Bin_78, Bin_79, Bin_8, Bin_80, Bin_81, Bin_82, Bin_83, Bin_84, Bin_85, Bin_86, Bin_87, Bin_88, Bin_89, Bin_9, Bin_90, Bin_91,
                                                Bin_92, Bin_93, Bin_94, Bin_95, Bin_96, Bin_97, Bin_98, Bin_99

But then when I try to get the actual number of bins in this collection directly from the data table, this is what I learn:

sqlite3 PROFILE.db "select count(distinct(bin_name)) from collections_of_splits where collection_name='concoct'"
159

159 != 103. It seems, none of the refined bins were updated in the info table. But it is an easy fix, and this actually solves it:

anvi-export-collection -p PROFILE.db -C concoct -O concoct-collection
anvi-delete-collection -p PROFILE.db -C concoct
anvi-import-collection concoct-collection.txt -C concoct -c CONTIGS.db -p PROFILE.db

When I run the following command now,

anvi-show-collections-and-bins -p PROFILE.db

I get this, which makes more sense:

Collection: "concoct"
===============================================
Collection ID ................................: concoct
Number of bins ...............................: 159
Number of splits described ...................: 17,310
Bin names ....................................: Bin_0, Bin_10, Bin_100, Bin_101, Bin_102, Bin_11, Bin_12, Bin_13, Bin_14_1, Bin_14_2, Bin_14_3, Bin_15, Bin_16, Bin_17, Bin_18, Bin_19, Bin_1_1, Bin_1_2,
                                                Bin_20, Bin_21_1, Bin_21_2, Bin_23_1, Bin_24, Bin_25, Bin_26, Bin_27_13, Bin_28, Bin_29, Bin_2_1, Bin_30, Bin_31, Bin_32_2, Bin_33, Bin_34, Bin_35, Bin_36,
                                                Bin_37, Bin_38_1, Bin_38_2, Bin_39, Bin_3_1, Bin_3_10, Bin_3_13, Bin_3_14, Bin_3_16, Bin_3_18, Bin_3_19, Bin_3_2, Bin_3_20, Bin_3_21, Bin_3_22, Bin_3_23,
                                                Bin_3_24, Bin_3_25, Bin_3_26, Bin_3_27, Bin_3_28, Bin_3_29, Bin_3_3, Bin_3_30, Bin_3_31_1, Bin_3_32, Bin_3_34, Bin_3_36, Bin_3_37, Bin_3_4, Bin_3_5,
                                                Bin_3_6, Bin_3_7, Bin_3_8, Bin_3_9, Bin_40, Bin_41, Bin_42_1, Bin_43, Bin_44, Bin_45_3, Bin_45_4, Bin_46_1, Bin_47, Bin_48, Bin_49_1, Bin_4_1, Bin_5,
                                                Bin_50_1, Bin_51, Bin_52, Bin_53_1, Bin_53_2, Bin_53_3, Bin_53_4, Bin_53_5, Bin_54, Bin_55, Bin_56, Bin_57, Bin_58, Bin_59, Bin_6, Bin_60, Bin_61, Bin_62_1,
                                                Bin_62_2, Bin_62_3, Bin_62_4, Bin_62_5, Bin_63_1, Bin_63_2, Bin_63_3, Bin_64, Bin_65, Bin_66_1, Bin_66_13, Bin_66_4, Bin_66_5, Bin_66_6, Bin_66_7, Bin_67,
                                                Bin_68_1, Bin_68_2, Bin_68_3, Bin_69, Bin_7, Bin_70_1, Bin_71_1, Bin_71_2, Bin_72, Bin_73, Bin_74, Bin_75, Bin_76, Bin_77, Bin_78, Bin_79, Bin_8, Bin_80,
                                                Bin_81, Bin_82, Bin_83_1, Bin_83_2, Bin_83_3, Bin_84_2, Bin_84_4, Bin_85, Bin_86, Bin_87, Bin_88, Bin_89, Bin_9, Bin_90, Bin_91, Bin_92, Bin_93, Bin_94,
                                                Bin_95, Bin_96, Bin_97, Bin_98, Bin_99_1

A lot of refined bins here. So it seems the new bins after refinement did not update the collections_info table. But this is another bug all by itself, because correcting it by re-importing the collection did not solve the original problem: Bin_3_32 continues to fail during summary.

First I created a collection only with this bin to better understand the problem:

grep Bin_3_32 concoct-collection.txt  > bin332.txt
anvi-import-collection bin332.txt -p PROFILE.db -c CONTIGS.db -C bin332

Everything seems to be working:

 :: anvi'o v6 master ::  ~/Downloads/bug >>> anvi-estimate-genome-completeness -p PROFILE.db -c CONTIGS.db -C bin332

Bins in collection "bin332"
===============================================
╒════════════╤══════════╤══════════════╤════════════════╤════════════════╤══════════════╤════════════════╕
│ bin name   │ domain   │   confidence │   % completion │   % redundancy │   num_splits │   total length │
╞════════════╪══════════╪══════════════╪════════════════╪════════════════╪══════════════╪════════════════╡
│ Bin_3_32   │ BACTERIA │          0.8 │           83.1 │           4.23 │          105 │        2109765 │
╘════════════╧══════════╧══════════════╧════════════════╧════════════════╧══════════════╧════════════════╛

* The 'domain' shown in this table is the domain anvi'o predicted for your contigs
in a given bin with the amount of confidence for that call in the 'domain'
column. If the domain is 'mixed', it means it is very likely you have contigs in
your bin that spans accross multiple domains of life (maybe it is worth a Nobel,
but more likely it is just garbage). If the domain is 'blank', it means anvi'o
did not find enough signal from single-copy core genes to determine what domain
it should use to estimate the completion and redundancy of this bin accurately.
You can get much more information about these estimations by running the same
command line with the additinal flag `--debug`.

 :: anvi'o v6 master ::  ~/Downloads/bug >>> anvi-estimate-scg-taxonomy -p PROFILE.db -c CONTIGS.db -C bin332

WARNING
===============================================
The SCG taxonomy database on your computer has a different version (v95.0) than
the SCG taxonomy information stored in your contigs database (v89). This is not
a problem and things will most likely continue to work fine, but we wanted to
let you know. You can get rid of this warning by re-running `anvi-run-scg-
taxonomy` on your database.

Contigs DB ...................................: CONTIGS.db
Profile DB ...................................: PROFILE.db
Metagenome mode ..............................: False

* 105 split names associated with 1 bins of in collection 'bin332' have been
successfully recovered 🎊

Estimated taxonomy for collection "bin332"
===============================================
╒══════════╤══════════════╤═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│          │   total_scgs │   supporting_scgs │ taxonomy                                                                                               │
╞══════════╪══════════════╪═══════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Bin_3_32 │           16 │                14 │ Bacteria / Bacteroidota / Bacteroidia / Bacteroidales / Muribaculaceae / CAG-873 / CAG-873 sp001689665 │
╘══════════╧══════════════╧═══════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Even the stupid HMMs:

anvi-get-sequences-for-hmm-hits -p PROFILE.db -c CONTIGS.db -C bin332 -b Bin_3_32 -o HMMs.fa
Init .........................................: 105 splits in 1 bin(s)
Hits .........................................: 102 hits for 9 source(s)
Mode .........................................: DNA sequences
Genes are concatenated .......................: False
Output .......................................: HMMs.fa

I could even split this guy into its own split project:

anvi-split -p PROFILE.db -c CONTIGS.db -C bin332 -b Bin_3_32 -o SPLIT

But running the summary on this collection still explodes, both in the main project files:

anvi-summarize -p PROFILE.db -c CONTIGS.db -C bin332 -o SUMMARY
(...)

Contigs DB ...................................: CONTIGS.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[10 Dec 20 19:18:28 Summarizing 1 of 1: 'Bin_3_32'] Creating the FASTA file ...                                                                                                                   ETA: NoneTraceback (most recent call last):
  File "/Users/meren/github/anvio/bin/anvi-summarize", line 123, in <module>
    main(args)
  File "/Users/meren/github/anvio/bin/anvi-summarize", line 69, in main
    summary.process()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 889, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 1314, in create
    self.store_sequences_for_hmm_hits()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 1574, in store_sequences_for_hmm_hits
    hmm_sequences_dict = s.get_sequences_dict_for_hmm_hits_in_splits({self.bin_id: self.split_names})
  File "/Users/meren/github/anvio/anvio/hmmops.py", line 337, in get_sequences_dict_for_hmm_hits_in_splits
    hmm_hit = self.hmm_hits[split_entry['hmm_hit_entry_id']]
KeyError: 392

AND in the split project files:

cd SPLIT/Bin_3_32/
anvi-summarize -p PROFILE.db -c CONTIGS.db -C DEFAULT -o SUMMARY
(...)

[10 Dec 20 19:24:36 Initializing non-single-copy HMM sources] ...                                                                                                                                          Traceback (most recent call last):
  File "/Users/meren/github/anvio/bin/anvi-summarize", line 123, in <module>
    main(args)
  File "/Users/meren/github/anvio/bin/anvi-summarize", line 69, in main
    summary.process()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 870, in process
    self.init()
  File "/Users/meren/github/anvio/anvio/summarizer.py", line 776, in init
    self.init_non_singlecopy_gene_hmm_sources()
  File "/Users/meren/github/anvio/anvio/dbops.py", line 454, in init_non_singlecopy_gene_hmm_sources
    hmm_hit = hmm_hits_table[e['hmm_hit_entry_id']]
KeyError: 392

But in different locations:

File "/Users/meren/github/anvio/anvio/hmmops.py", line 337, in get_sequences_dict_for_hmm_hits_in_splits

vs

File "/Users/meren/github/anvio/anvio/dbops.py", line 454, in init_non_singlecopy_gene_hmm_sources

Very interesting bug (and one of the first ones I see in dbops during like the last 2 years :) SO THERE IS SOMETHING FUNNY GOING ON HERE for sure.

So what's up with key 392?

It certainly does not show up in hmm_hits_table where the error is coming from:

{
  "390": {
    "source": "Ribosomal_RNA_23S",
    "gene_unique_identifier": "9dce5ecde7956a2b5aaf12837ef2e0cbc34c8b0f7c37ad0c4125a312",
    "gene_callers_id": 670058,
    "gene_name": "23S_rRNA_arc",
    "gene_hmm_id": "-",
    "e_value": 0
  },
  "391": {
    "source": "Ribosomal_RNA_23S",
    "gene_unique_identifier": "3c7edb2b450de6ae9d0e3d5fc580397e804cba4e540214e321b102ec",
    "gene_callers_id": 670059,
    "gene_name": "23S_rRNA_arc",
    "gene_hmm_id": "-",
    "e_value": 0
  },
  "393": {
    "source": "Ribosomal_RNA_23S",
    "gene_unique_identifier": "bd7cd7622ee6fd246db515baae3df4e739e79fe7905e373df877c1d4",
    "gene_callers_id": 670061,
    "gene_name": "23S_rRNA_arc",
    "gene_hmm_id": "-",
    "e_value": 0
  },
  "11955": {
    "source": "Ribosomal_RNA_16S",
    "gene_unique_identifier": "cf477c3b1d293f5c3ca8c1ea7af5e344849b5d63e08bf1537774f551",
    "gene_callers_id": 670607,
    "gene_name": "16S_rRNA_bac",
    "gene_hmm_id": "RF00177_bac",
    "e_value": 0
  },
  "11956": {
    "source": "Ribosomal_RNA_16S",
    "gene_unique_identifier": "3337bd21faf66da2a7d17425aeff636d785fd2036ab5591db353e41d",
    "gene_callers_id": 670608,
    "gene_name": "16S_rRNA_bac",
    "gene_hmm_id": "RF00177_bac",
    "e_value": 0
  },
  "11957": {
    "source": "Ribosomal_RNA_16S",
    "gene_unique_identifier": "338c6a6fd9fb4b061377548fe53ded244eb44fa6a7d67b0c97c217b2",
    "gene_callers_id": 670609,
    "gene_name": "16S_rRNA_bac",
    "gene_hmm_id": "RF00177_bac",
    "e_value": 0
  },
  "11958": {
    "source": "Ribosomal_RNA_16S",
    "gene_unique_identifier": "5bc8475124c903655d303f17155c10fb5f75024cb3e2361d548d8a05",
    "gene_callers_id": 670610,
    "gene_name": "16S_rRNA_bac",
    "gene_hmm_id": "RF00177_bac",
    "e_value": 0
  }
}

But then it IS in the non_singlecopy_gene_hmm_results_dict:

{
  "1": {
    "hmm_hit_entry_id": 393,
    "split": "Ppolionotus_3F_000000007193_split_00043",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_23S"
  },
  "2": {
    "hmm_hit_entry_id": 390,
    "split": "Ppolionotus_3F_000000007193_split_00061",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_23S"
  },
  "3": {
    "hmm_hit_entry_id": 391,
    "split": "Ppolionotus_3F_000000007193_split_00084",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_23S"
  },
  "4": {
    "hmm_hit_entry_id": 392,
    "split": "Ppolionotus_3F_000000007193_split_00105",
    "percentage_in_split": 45.27449617790132,
    "source": "Ribosomal_RNA_23S"
  },
  "68": {
    "hmm_hit_entry_id": 11958,
    "split": "Ppolionotus_3F_000000007193_split_00043",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_16S"
  },
  "69": {
    "hmm_hit_entry_id": 11955,
    "split": "Ppolionotus_3F_000000007193_split_00061",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_16S"
  },
  "70": {
    "hmm_hit_entry_id": 11956,
    "split": "Ppolionotus_3F_000000007193_split_00084",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_16S"
  },
  "71": {
    "hmm_hit_entry_id": 11957,
    "split": "Ppolionotus_3F_000000007193_split_00105",
    "percentage_in_split": 100,
    "source": "Ribosomal_RNA_16S"
  }

Do you see what is unique for hmm_hit_entry_id 392?

Less than half of this gene is in Ppolionotus_3F_000000007193_split_00105. The remainder is in Ppolionotus_3F_000000007193_split_00106, which is not binned into Bin_3_32.

Just for posterity, this is the hmm_hits for Ribosomal_RNA_23S:

image

And this is the hmm_hits_in_splits for Ribosomal_RNA_23S:

image

I will continue with this investigation. The first thing I want to test is to see whether it would have changed anything if the this bin included Ppolionotus_3F_000000007193_split_00106 instead of Ppolionotus_3F_000000007193_split_00105.


PS: removing the Ribosomal_RNA_23S solves the issue (obvi) so Jon can move on with his investigation:

anvi-delete-hmms -c CONTIGS.db --hmm-source Ribosomal_RNA_23S
tanaes commented 3 years ago

Ahhh ok I think I know how this happened then! There were a couple bins that improved dramatically if I actually split internally to a contig. Probably a misassembly -- i suspect it's no accident that it occurs within an rRNA operon!

meren commented 3 years ago

Ahhh ok I think I know how this happened then! There were a couple bins that improved dramatically if I actually split internally to a contig.

Yes, this happens only when (1) a gene is split between two splits, and (2) only a subset of those splits described in a bin (that's why it took so long to discover it and I still have to work to figure out the extent of the second rule).

Splitting splits in distinct bins is a legal thing to do in anvi'o. In fact the purpose of 'soft splits' is exactly that: so one can set the split size small and break contigs if there are chimera or regions that they wish to leave out for any reason.

I will work on this and make sure it is resolved before v7. Thank you for your patience, Jon.

tanaes commented 3 years ago

This feels like the same bug, but on my next bin refining venture, I ended up with the following error:

Contigs DB ...................................: 02_CONTIGS/Ppolionotus_5F-contigs.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 12:09:27 Summarizing 1 of 214: 'Bin_101'] Sorting out functions ...                                                                         ETA: NoneTraceback (most recent call last):
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
    main(args)
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
    summary.process()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
    self.store_genes_basic_info()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
    d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293821

So it's running into problems with a gene call here, and deleting HMMs isn't sufficient to make it work! When I export gene calls, sure enough there is a gap in my ID list covering that range.

Is there an easy way to delete the gene calls table from a contigs db?

meren commented 3 years ago

These split genes are exclusively going to be associated with HMMs that are of type Ribosomal RNA :/

Can you please also remove the following ones to make sure they are out of your way?

anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_28S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_5S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_16S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_12S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_18S

If this doesn't work I will offer my resignation from science.

tanaes commented 3 years ago

I'm so glad there is no one able to accept your offer of resignation, because I already tried deleting the HMMs! Didn't work. :(

Anything below offer a hint?

(/workdir/lab_envs/anvio-main) 🐭  cbsumoeller:anvio $ anvi-delete-hmms -c 02_CONTIGS/Ppolionotus_5F-contigs.db --hmm-source Ribosomal_RNA_18S

WARNING
===============================================
The HMM tables in your contigs databse is empty. Now anvi'o will quit and go
back to sleep.

(/workdir/lab_envs/anvio-main) 🐭  cbsumoeller:anvio $anvi-summarize -p 06_MERGED/Ppolionotus_5F/PROFILE.db -c 02_CONTIGS/Ppolionotus_5F-contigs.db -C concoct -o 07_SUMMARY/Ppolionotus_5F/concoct-refine2
Contigs DB ...................................: Initialized: 02_CONTIGS/Ppolionotus_5F-contigs.db (v. 20)                                                        

WARNING
===============================================
ProfileSuperClass found a collection focus, which means it will be initialized
using only the splits in the profile database that are affiliated with the
collection concoct and all bins it describes.

Auxiliary Data ...............................: Found: 06_MERGED/Ppolionotus_5F/AUXILIARY-DATA.db (v. 2)                                                         
Profile Super ................................: Initialized with 24731 of 44762 splits: 06_MERGED/Ppolionotus_5F/PROFILE.db (v. 35)

THE MORE YOU KNOW 🌈
===============================================
Someone asked the Contigs Superclass to initialize only a subset of contig
sequences. Usually this is a good thing and means that some good code somewhere
is looking after you. Just FYI, this class will only know about 4263 contig
sequences instead of all the things in the database.

* FYI: A subset of split sequences are being initialized (24731 of 45440 the
contigs database knows about, to be precise). Nothing to worry about. Probably.

WARNING
===============================================
Things are not quite OK. It seems 3 of the domains that are known to the
classifier anvi'o uses to predict domains for completion estimation are missing
from your contigs database. This means, you didn't run the program `anvi-run-
hmms` with default parameters, or you removed some essential SCG domains from it
later. Or you did something else. Who knows. Here is the list of domains that
are making us upset here: "". We hope you are happy. If you want to get rid of
this warning you can run `anvi-run-hmms` on this your contigs database whenever
it is convenient to you, so anvi'o can make sure you have everything in the
right place.

WARNING
===============================================
OK. We have a VERY interesting problem. You have all the SCG domains necessary
to run the predictor covered in your contigs database, however, 3 HMM sources
that are used during the training of the domain predictor does not seem to occur
in your contigs database :/ Here is the list of HMM sources that are making us
upset here: "Protista_83, Bacteria_71, Archaea_76". This most likely means you
are using a new version of anvi'o with older single-copy core gene sources, or
you are exploring new single-copy core gene sources to see how they behave.
That's all good and very exciting, but unfortunately anvi'o will not be able to
predict domains due to this incompatibility here. You could solve this problem
by running `anvi-run-hmms` on your contigs database, but you can also live
without solving it as anvi'o will continue running by not utilizing domain-
specific HMMs for completion/redundancy estimates, but giving you all the
results all at once.

WARNING
===============================================
The SCG taxonomy database on your computer has a different version (v95.0) than
the SCG taxonomy information stored in your contigs database (v89). This is not
a problem and things will most likely continue to work fine, but we wanted to
let you know. You can get rid of this warning by re-running `anvi-run-scg-
taxonomy` on your database.

Contigs DB ...................................: 02_CONTIGS/Ppolionotus_5F-contigs.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 12:35:26 Summarizing 1 of 214: 'Bin_101'] Sorting out functions ...                                                                         ETA: NoneTraceback (most recent call last):
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
    main(args)
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
    summary.process()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
    self.store_genes_basic_info()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
    d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293821
(/workdir/lab_envs/anvio-main) 🐭  cbsumoeller:anvio $ 
meren commented 3 years ago

Ah 🤦

These are still the same files you sent me, right, Jon?

I will look into this today or tomorrow and will let you know once the main can summarize your files properly.

meren commented 3 years ago

Interesting, I can't reproduce this one. Can you please run these commands and let me know if they work for you first:

anvi-split -p PROFILE.db -c CONTIGS.db -C concoct -b Bin_101 -o SPLIT
cd SPLIT/Bin_101/
anvi-summarize -p PROFILE.db -c CONTIGS.db -C DEFAULT -o SUMMARY

then you can remove the SPLIT directory:

cd ../..
rm -rf SPLIT
tanaes commented 3 years ago

This is actually a new contigs and profile database, from a separate set of samples altogether!

Same issue with the split bin:

(/workdir/lab_envs/anvio-main) 🐭  cbsumoeller:Bin_101 $ anvi-summarize -p PROFILE.db -c CONTIGS.db -C DEFAULT -o SUMMARY
Contigs DB ...................................: Initialized: CONTIGS.db (v. 20)                                                                                  

WARNING
===============================================
ProfileSuperClass found a collection focus, which means it will be initialized
using only the splits in the profile database that are affiliated with the
collection DEFAULT and all bins it describes.

Auxiliary Data ...............................: Found: AUXILIARY-DATA.db (v. 2)                                                                                  
Profile Super ................................: Initialized with 557 of 557 splits: PROFILE.db (v. 35)

THE MORE YOU KNOW 🌈
===============================================
Someone asked the Contigs Superclass to initialize only a subset of contig
sequences. Usually this is a good thing and means that some good code somewhere
is looking after you. Just FYI, this class will only know about 411 contig
sequences instead of all the things in the database.

* FYI: A subset of split sequences are being initialized (557 of 557 the contigs
database knows about, to be precise). Nothing to worry about. Probably.

WARNING
===============================================
Things are not quite OK. It seems 3 of the domains that are known to the
classifier anvi'o uses to predict domains for completion estimation are missing
from your contigs database. This means, you didn't run the program `anvi-run-
hmms` with default parameters, or you removed some essential SCG domains from it
later. Or you did something else. Who knows. Here is the list of domains that
are making us upset here: "". We hope you are happy. If you want to get rid of
this warning you can run `anvi-run-hmms` on this your contigs database whenever
it is convenient to you, so anvi'o can make sure you have everything in the
right place.

WARNING
===============================================
OK. We have a VERY interesting problem. You have all the SCG domains necessary
to run the predictor covered in your contigs database, however, 3 HMM sources
that are used during the training of the domain predictor does not seem to occur
in your contigs database :/ Here is the list of HMM sources that are making us
upset here: "Protista_83, Bacteria_71, Archaea_76". This most likely means you
are using a new version of anvi'o with older single-copy core gene sources, or
you are exploring new single-copy core gene sources to see how they behave.
That's all good and very exciting, but unfortunately anvi'o will not be able to
predict domains due to this incompatibility here. You could solve this problem
by running `anvi-run-hmms` on your contigs database, but you can also live
without solving it as anvi'o will continue running by not utilizing domain-
specific HMMs for completion/redundancy estimates, but giving you all the
results all at once.

WARNING
===============================================
The SCG taxonomy database on your computer has a different version (v95.0) than
the SCG taxonomy information stored in your contigs database (v89). This is not
a problem and things will most likely continue to work fine, but we wanted to
let you know. You can get rid of this warning by re-running `anvi-run-scg-
taxonomy` on your database.

Contigs DB ...................................: CONTIGS.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 13:02:04 Summarizing 1 of 1: 'ALL_SPLITS'] Sorting out functions ...                                                                          ETA: 0sTraceback (most recent call last):
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
    main(args)
  File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
    summary.process()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
    self.store_genes_basic_info()
  File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
    d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293816
(/workdir/lab_envs/anvio-main) 🐭  cbsumoeller:Bin_101 $ 
meren commented 3 years ago

I see. This looks like an entirely different problem. A gene caller id (1293816) not found in genes_in_contigs_dict is unheard of :) I am not sure how did that happen. the only possible way for that to happen is ... well I really have no clue. I will first focus on the first problem, and I will perhaps ask your help to get my hands on this new contigs db to look for the other.

tanaes commented 3 years ago

You bet, and thanks again! Sorry for causing so much trouble!

meren commented 3 years ago

No, please, this is golden (to address some critical bugs), and I am very thankful for your patient :)

meren commented 3 years ago

Hey @tanaes, if you pull from the main branch, the summary of the first dataset you've sent me should be working without any issues with HMMs.

Can you please let me know if it is the case?

tanaes commented 3 years ago

Yay!!! It totally worked! Thanks a million!

meren commented 3 years ago

Thank you for your help to figure this one out :)

jarrodscott commented 1 month ago

Howdy! I realize this issue is closed but I wanted to add that I just had a somewhat similar error running anvi-summarize on bin collections.

Traceback (most recent call last):
  File "/home/scottjj/github/anvio/bin/anvi-summarize", line 122, in <module>
    main(args)
  File "/home/scottjj/github/anvio/bin/anvi-summarize", line 68, in main
    summary.process()
  File "/home/scottjj/github/anvio/anvio/summarizer.py", line 864, in process
    self.summary['collection'][bin_id] = bin.create()
  File "/home/scottjj/github/anvio/anvio/summarizer.py", line 1289, in create
    self.store_sequences_for_hmm_hits()
  File "/home/scottjj/github/anvio/anvio/summarizer.py", line 1556, in store_sequences_for_hmm_hits
    hmm_sequences_dict = s.get_sequences_dict_for_hmm_hits_in_splits({self.bin_id: self.split_names})
  File "/home/scottjj/github/anvio/anvio/hmmops.py", line 468, in get_sequences_dict_for_hmm_hits_in_splits
    gene_call = self.genes_in_contigs[gene_callers_id]

I ran through some of the tests described above but nothing stood out as an issue. When I originally ran anvi-run-hmms I added the flag --add-to-functions-table. So I removed all HMMs using anvi-delete-hmms and anvi-delete-functions and then reran anvi-run-hmms without --add-to-functions-table and then anvi-summarize and it worked great. I haven't tried removing and adding HMMs back in using the --add-to-functions-table and then trying to summarize. Maybe something go wonky with all of my messing around...

meren commented 4 weeks ago

Hey @jarrodscott, thank you for mentioning this. It is crazy that there is still a sinister bug there. Do you happen to have the files that led to this error and would you be willing to privately share them with me along with the exact command line you're usng to get this error so I can try to find out more about its roots? :)

jarrodscott commented 3 weeks ago

Hello @meren ! Of course. I will send you a dropbox link to the contigs.db and profile.db. In the meantime, here is what I did.

  1. Snakemake mg workflow (workflow json file attached)
anvi-run-workflow -w metagenomics -c default_mg.json --additional-params --jobs 28 --resources nodes=28 --keep-going --rerun-incomplete --unlock
anvi-run-workflow -w metagenomics -c default_mg.json --additional-params --jobs 28 --resources nodes=28 --keep-going --rerun-incomplete

Now the original workflow run I had anvi-run-hmm configure like so:

    "anvi_run_hmms": {
        "run": true,
        "threads": 14,
        "--also-scan-trnas": true,
        "--installed-hmm-profile": "",
        "--hmm-profile-dir": "",
        "--add-to-functions-table": ""
    },

After the workflow finish I thought, wouldn't it be nice to add the HMM hits to the functions table, so I changed the config file to set "--add-to-functions-table": true. So I ran the following:

anvi-delete-hmms  -c PAN-contigs.db 
anvi-run-hmms -c PAN-contigs.db --add-to-functions-table

After automatic binning I ran

for collection in `cat collections.txt`

do
    anvi-summarize -p 06_MERGED/PAN/PROFILE.db -c PAN-contigs.db -C $collection -o 09_AUTO_BINNING_SUMMARY/$collection --cog-data-dir /pool/genomics/stri_istmobiome/dbs/cog_db/
done

collections.txt is just a list of collections. This is where the error popped up. So I ran these two commands:

anvi-delete-hmms  -c PAN-contigs.db 
anvi-delete-functions -c 03_CONTIGS/PAN-contigs.db --annotation-sources Transfer_RNAs,Archaea_76,Bacteria_71,Protista_83,Ribosomal_RNA_12S,Ribosomal_RNA_16S,Ribosomal_RNA_18S,Ribosomal_RNA_23S,Ribosomal_RNA_28S,Ribosomal_RNA_5S

And then reran anvi-run-hmms -c PAN-contigs.db without the --add-to-functions-table flag and the anvi-summarize worked without issue. Apologies for how convoluted this is :)

meren commented 3 weeks ago

Thank you for this explanation, @jarrodscott. Do you happen to remember which collection gave you the error? There are 13 collections in your merged profile-db, and I want to avoid summarizing each one of them if possible :p For rapid testing I will identify the offending bin in a collection, anvi-split it out, and then see if there is any fix I can offer to address this issue. So knowing the collection would be a good start :)

meren commented 3 weeks ago

Oh, wait. I just realized I need to first run

anvi-run-hmms -c PAN-contigs.db --add-to-functions-table

👍

jarrodscott commented 3 weeks ago

Hey @meren . I paired down the number of collections in the new PROFILE.db I sent along with the new CONTIGS.db. I was able to reproduce the error with these databases. It looks like summarize goes through the first few bins on the CONCOCT collection, then fails, processes all of the METABAT2 bins, and then fail on the remaining collections.