Closed meren closed 3 years ago
Ahhh ok I think I know how this happened then! There were a couple bins that improved dramatically if I actually split internally to a contig. Probably a misassembly -- i suspect it's no accident that it occurs within an rRNA operon!
Ahhh ok I think I know how this happened then! There were a couple bins that improved dramatically if I actually split internally to a contig.
Yes, this happens only when (1) a gene is split between two splits, and (2) only a subset of those splits described in a bin (that's why it took so long to discover it and I still have to work to figure out the extent of the second rule).
Splitting splits in distinct bins is a legal thing to do in anvi'o. In fact the purpose of 'soft splits' is exactly that: so one can set the split size small and break contigs if there are chimera or regions that they wish to leave out for any reason.
I will work on this and make sure it is resolved before v7
. Thank you for your patience, Jon.
This feels like the same bug, but on my next bin refining venture, I ended up with the following error:
Contigs DB ...................................: 02_CONTIGS/Ppolionotus_5F-contigs.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 12:09:27 Summarizing 1 of 214: 'Bin_101'] Sorting out functions ... ETA: NoneTraceback (most recent call last):
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
main(args)
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
summary.process()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
self.summary['collection'][bin_id] = bin.create()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
self.store_genes_basic_info()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293821
So it's running into problems with a gene call here, and deleting HMMs isn't sufficient to make it work! When I export gene calls, sure enough there is a gap in my ID list covering that range.
Is there an easy way to delete the gene calls table from a contigs db?
These split genes are exclusively going to be associated with HMMs that are of type Ribosomal RNA :/
Can you please also remove the following ones to make sure they are out of your way?
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_28S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_5S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_16S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_12S
anvi-delete-hmms -c CONTIGS.db --hmm-soure Ribosomal_RNA_18S
If this doesn't work I will offer my resignation from science.
I'm so glad there is no one able to accept your offer of resignation, because I already tried deleting the HMMs! Didn't work. :(
Anything below offer a hint?
(/workdir/lab_envs/anvio-main) 🐭 cbsumoeller:anvio $ anvi-delete-hmms -c 02_CONTIGS/Ppolionotus_5F-contigs.db --hmm-source Ribosomal_RNA_18S
WARNING
===============================================
The HMM tables in your contigs databse is empty. Now anvi'o will quit and go
back to sleep.
(/workdir/lab_envs/anvio-main) 🐭 cbsumoeller:anvio $anvi-summarize -p 06_MERGED/Ppolionotus_5F/PROFILE.db -c 02_CONTIGS/Ppolionotus_5F-contigs.db -C concoct -o 07_SUMMARY/Ppolionotus_5F/concoct-refine2
Contigs DB ...................................: Initialized: 02_CONTIGS/Ppolionotus_5F-contigs.db (v. 20)
WARNING
===============================================
ProfileSuperClass found a collection focus, which means it will be initialized
using only the splits in the profile database that are affiliated with the
collection concoct and all bins it describes.
Auxiliary Data ...............................: Found: 06_MERGED/Ppolionotus_5F/AUXILIARY-DATA.db (v. 2)
Profile Super ................................: Initialized with 24731 of 44762 splits: 06_MERGED/Ppolionotus_5F/PROFILE.db (v. 35)
THE MORE YOU KNOW 🌈
===============================================
Someone asked the Contigs Superclass to initialize only a subset of contig
sequences. Usually this is a good thing and means that some good code somewhere
is looking after you. Just FYI, this class will only know about 4263 contig
sequences instead of all the things in the database.
* FYI: A subset of split sequences are being initialized (24731 of 45440 the
contigs database knows about, to be precise). Nothing to worry about. Probably.
WARNING
===============================================
Things are not quite OK. It seems 3 of the domains that are known to the
classifier anvi'o uses to predict domains for completion estimation are missing
from your contigs database. This means, you didn't run the program `anvi-run-
hmms` with default parameters, or you removed some essential SCG domains from it
later. Or you did something else. Who knows. Here is the list of domains that
are making us upset here: "". We hope you are happy. If you want to get rid of
this warning you can run `anvi-run-hmms` on this your contigs database whenever
it is convenient to you, so anvi'o can make sure you have everything in the
right place.
WARNING
===============================================
OK. We have a VERY interesting problem. You have all the SCG domains necessary
to run the predictor covered in your contigs database, however, 3 HMM sources
that are used during the training of the domain predictor does not seem to occur
in your contigs database :/ Here is the list of HMM sources that are making us
upset here: "Protista_83, Bacteria_71, Archaea_76". This most likely means you
are using a new version of anvi'o with older single-copy core gene sources, or
you are exploring new single-copy core gene sources to see how they behave.
That's all good and very exciting, but unfortunately anvi'o will not be able to
predict domains due to this incompatibility here. You could solve this problem
by running `anvi-run-hmms` on your contigs database, but you can also live
without solving it as anvi'o will continue running by not utilizing domain-
specific HMMs for completion/redundancy estimates, but giving you all the
results all at once.
WARNING
===============================================
The SCG taxonomy database on your computer has a different version (v95.0) than
the SCG taxonomy information stored in your contigs database (v89). This is not
a problem and things will most likely continue to work fine, but we wanted to
let you know. You can get rid of this warning by re-running `anvi-run-scg-
taxonomy` on your database.
Contigs DB ...................................: 02_CONTIGS/Ppolionotus_5F-contigs.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 12:35:26 Summarizing 1 of 214: 'Bin_101'] Sorting out functions ... ETA: NoneTraceback (most recent call last):
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
main(args)
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
summary.process()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
self.summary['collection'][bin_id] = bin.create()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
self.store_genes_basic_info()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293821
(/workdir/lab_envs/anvio-main) 🐭 cbsumoeller:anvio $
Ah 🤦
These are still the same files you sent me, right, Jon?
I will look into this today or tomorrow and will let you know once the main
can summarize your files properly.
Interesting, I can't reproduce this one. Can you please run these commands and let me know if they work for you first:
anvi-split -p PROFILE.db -c CONTIGS.db -C concoct -b Bin_101 -o SPLIT
cd SPLIT/Bin_101/
anvi-summarize -p PROFILE.db -c CONTIGS.db -C DEFAULT -o SUMMARY
then you can remove the SPLIT
directory:
cd ../..
rm -rf SPLIT
This is actually a new contigs and profile database, from a separate set of samples altogether!
Same issue with the split bin:
(/workdir/lab_envs/anvio-main) 🐭 cbsumoeller:Bin_101 $ anvi-summarize -p PROFILE.db -c CONTIGS.db -C DEFAULT -o SUMMARY
Contigs DB ...................................: Initialized: CONTIGS.db (v. 20)
WARNING
===============================================
ProfileSuperClass found a collection focus, which means it will be initialized
using only the splits in the profile database that are affiliated with the
collection DEFAULT and all bins it describes.
Auxiliary Data ...............................: Found: AUXILIARY-DATA.db (v. 2)
Profile Super ................................: Initialized with 557 of 557 splits: PROFILE.db (v. 35)
THE MORE YOU KNOW 🌈
===============================================
Someone asked the Contigs Superclass to initialize only a subset of contig
sequences. Usually this is a good thing and means that some good code somewhere
is looking after you. Just FYI, this class will only know about 411 contig
sequences instead of all the things in the database.
* FYI: A subset of split sequences are being initialized (557 of 557 the contigs
database knows about, to be precise). Nothing to worry about. Probably.
WARNING
===============================================
Things are not quite OK. It seems 3 of the domains that are known to the
classifier anvi'o uses to predict domains for completion estimation are missing
from your contigs database. This means, you didn't run the program `anvi-run-
hmms` with default parameters, or you removed some essential SCG domains from it
later. Or you did something else. Who knows. Here is the list of domains that
are making us upset here: "". We hope you are happy. If you want to get rid of
this warning you can run `anvi-run-hmms` on this your contigs database whenever
it is convenient to you, so anvi'o can make sure you have everything in the
right place.
WARNING
===============================================
OK. We have a VERY interesting problem. You have all the SCG domains necessary
to run the predictor covered in your contigs database, however, 3 HMM sources
that are used during the training of the domain predictor does not seem to occur
in your contigs database :/ Here is the list of HMM sources that are making us
upset here: "Protista_83, Bacteria_71, Archaea_76". This most likely means you
are using a new version of anvi'o with older single-copy core gene sources, or
you are exploring new single-copy core gene sources to see how they behave.
That's all good and very exciting, but unfortunately anvi'o will not be able to
predict domains due to this incompatibility here. You could solve this problem
by running `anvi-run-hmms` on your contigs database, but you can also live
without solving it as anvi'o will continue running by not utilizing domain-
specific HMMs for completion/redundancy estimates, but giving you all the
results all at once.
WARNING
===============================================
The SCG taxonomy database on your computer has a different version (v95.0) than
the SCG taxonomy information stored in your contigs database (v89). This is not
a problem and things will most likely continue to work fine, but we wanted to
let you know. You can get rid of this warning by re-running `anvi-run-scg-
taxonomy` on your database.
Contigs DB ...................................: CONTIGS.db
Profile DB ...................................: None
Metagenome mode ..............................: False
[12 Dec 20 13:02:04 Summarizing 1 of 1: 'ALL_SPLITS'] Sorting out functions ... ETA: 0sTraceback (most recent call last):
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 123, in <module>
main(args)
File "/home/jgs286/git_sw/anvio/bin/anvi-summarize", line 69, in main
summary.process()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 889, in process
self.summary['collection'][bin_id] = bin.create()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1326, in create
self.store_genes_basic_info()
File "/home/jgs286/git_sw/anvio/anvio/summarizer.py", line 1513, in store_genes_basic_info
d[gene_callers_id][header] = self.summary.genes_in_contigs_dict[gene_callers_id][header]
KeyError: 1293816
(/workdir/lab_envs/anvio-main) 🐭 cbsumoeller:Bin_101 $
I see. This looks like an entirely different problem. A gene caller id (1293816
) not found in genes_in_contigs_dict
is unheard of :) I am not sure how did that happen. the only possible way for that to happen is ... well I really have no clue. I will first focus on the first problem, and I will perhaps ask your help to get my hands on this new contigs db to look for the other.
You bet, and thanks again! Sorry for causing so much trouble!
No, please, this is golden (to address some critical bugs), and I am very thankful for your patient :)
Hey @tanaes, if you pull from the main
branch, the summary of the first dataset you've sent me should be working without any issues with HMMs.
Can you please let me know if it is the case?
Yay!!! It totally worked! Thanks a million!
Thank you for your help to figure this one out :)
Howdy! I realize this issue is closed but I wanted to add that I just had a somewhat similar error running anvi-summarize
on bin collections.
Traceback (most recent call last):
File "/home/scottjj/github/anvio/bin/anvi-summarize", line 122, in <module>
main(args)
File "/home/scottjj/github/anvio/bin/anvi-summarize", line 68, in main
summary.process()
File "/home/scottjj/github/anvio/anvio/summarizer.py", line 864, in process
self.summary['collection'][bin_id] = bin.create()
File "/home/scottjj/github/anvio/anvio/summarizer.py", line 1289, in create
self.store_sequences_for_hmm_hits()
File "/home/scottjj/github/anvio/anvio/summarizer.py", line 1556, in store_sequences_for_hmm_hits
hmm_sequences_dict = s.get_sequences_dict_for_hmm_hits_in_splits({self.bin_id: self.split_names})
File "/home/scottjj/github/anvio/anvio/hmmops.py", line 468, in get_sequences_dict_for_hmm_hits_in_splits
gene_call = self.genes_in_contigs[gene_callers_id]
I ran through some of the tests described above but nothing stood out as an issue. When I originally ran anvi-run-hmms
I added the flag --add-to-functions-table
. So I removed all HMMs using anvi-delete-hmms
and anvi-delete-functions
and then reran anvi-run-hmms
without --add-to-functions-table
and then anvi-summarize
and it worked great. I haven't tried removing and adding HMMs back in using the --add-to-functions-table
and then trying to summarize. Maybe something go wonky with all of my messing around...
Hey @jarrodscott, thank you for mentioning this. It is crazy that there is still a sinister bug there. Do you happen to have the files that led to this error and would you be willing to privately share them with me along with the exact command line you're usng to get this error so I can try to find out more about its roots? :)
Hello @meren ! Of course. I will send you a dropbox link to the contigs.db
and profile.db
. In the meantime, here is what I did.
json
file attached)anvi-run-workflow -w metagenomics -c default_mg.json --additional-params --jobs 28 --resources nodes=28 --keep-going --rerun-incomplete --unlock
anvi-run-workflow -w metagenomics -c default_mg.json --additional-params --jobs 28 --resources nodes=28 --keep-going --rerun-incomplete
Now the original workflow run I had anvi-run-hmm
configure like so:
"anvi_run_hmms": {
"run": true,
"threads": 14,
"--also-scan-trnas": true,
"--installed-hmm-profile": "",
"--hmm-profile-dir": "",
"--add-to-functions-table": ""
},
After the workflow finish I thought, wouldn't it be nice to add the HMM hits to the functions table, so I changed the config file to set "--add-to-functions-table": true
. So I ran the following:
anvi-delete-hmms -c PAN-contigs.db
anvi-run-hmms -c PAN-contigs.db --add-to-functions-table
After automatic binning I ran
for collection in `cat collections.txt`
do
anvi-summarize -p 06_MERGED/PAN/PROFILE.db -c PAN-contigs.db -C $collection -o 09_AUTO_BINNING_SUMMARY/$collection --cog-data-dir /pool/genomics/stri_istmobiome/dbs/cog_db/
done
collections.txt
is just a list of collections. This is where the error popped up. So I ran these two commands:
anvi-delete-hmms -c PAN-contigs.db
anvi-delete-functions -c 03_CONTIGS/PAN-contigs.db --annotation-sources Transfer_RNAs,Archaea_76,Bacteria_71,Protista_83,Ribosomal_RNA_12S,Ribosomal_RNA_16S,Ribosomal_RNA_18S,Ribosomal_RNA_23S,Ribosomal_RNA_28S,Ribosomal_RNA_5S
And then reran anvi-run-hmms -c PAN-contigs.db
without the --add-to-functions-table
flag and the anvi-summarize
worked without issue. Apologies for how convoluted this is :)
Thank you for this explanation, @jarrodscott. Do you happen to remember which collection gave you the error? There are 13 collections in your merged profile-db, and I want to avoid summarizing each one of them if possible :p For rapid testing I will identify the offending bin in a collection, anvi-split
it out, and then see if there is any fix I can offer to address this issue. So knowing the collection would be a good start :)
Oh, wait. I just realized I need to first run
anvi-run-hmms -c PAN-contigs.db --add-to-functions-table
👍
Hey @meren . I paired down the number of collections in the new PROFILE.db
I sent along with the new CONTIGS.db
. I was able to reproduce the error with these databases. It looks like summarize goes through the first few bins on the CONCOCT collection, then fails, processes all of the METABAT2 bins, and then fail on the remaining collections.
Jon Sanders run into the following issue during
anvi-summarize
and sent the following traceback to anvi'o Slack:He mentioned the following:
Switching to the
main
branch did not solve the problem, either. He kindly sent the files through e-mail for me to reproduce this bug. I will document my journey with these data here.I am now looking at these databases, and while I can see at least one thing that is quite wrong here, I can’t figure out how it could have happened without destroying everything else: the information in
collections_info
table is incompatible with the actual data in the concoct collection. When I run the following command,this is the output I get for the collection concoct:
But then when I try to get the actual number of bins in this collection directly from the data table, this is what I learn:
159 != 103
. It seems, none of the refined bins were updated in the info table. But it is an easy fix, and this actually solves it:When I run the following command now,
I get this, which makes more sense:
A lot of refined bins here. So it seems the new bins after refinement did not update the
collections_info
table. But this is another bug all by itself, because correcting it by re-importing the collection did not solve the original problem:Bin_3_32
continues to fail during summary.First I created a collection only with this bin to better understand the problem:
Everything seems to be working:
Even the stupid HMMs:
I could even split this guy into its own split project:
But running the summary on this collection still explodes, both in the main project files:
AND in the split project files:
But in different locations:
Very interesting bug (and one of the first ones I see in
dbops
during like the last 2 years :) SO THERE IS SOMETHING FUNNY GOING ON HERE for sure.So what's up with key
392
?It certainly does not show up in
hmm_hits_table
where the error is coming from:But then it IS in the
non_singlecopy_gene_hmm_results_dict
:Do you see what is unique for
hmm_hit_entry_id
392
?Less than half of this gene is in
Ppolionotus_3F_000000007193_split_00105
. The remainder is inPpolionotus_3F_000000007193_split_00106
, which is not binned intoBin_3_32
.Just for posterity, this is the
hmm_hits
forRibosomal_RNA_23S
:And this is the
hmm_hits_in_splits
forRibosomal_RNA_23S
:I will continue with this investigation. The first thing I want to test is to see whether it would have changed anything if the this bin included
Ppolionotus_3F_000000007193_split_00106
instead ofPpolionotus_3F_000000007193_split_00105
.PS: removing the
Ribosomal_RNA_23S
solves the issue (obvi) so Jon can move on with his investigation: