Closed xvazquezc closed 7 years ago
Hi Xabi,
Interesting point. To decide which one to show, anvi'o is using completion - redundancy
. In your case this value is -50 and -121 for Rinke et al and Campbell et al, which makes it look like Rinke et al is more suitable to make sense of this bin. Which is clearly not correct :)
I am not sure how to address this one. Do you or others have a suggestion?
Best,
I see...
I guess that completion + redundancy
could work even in the most redundant bins based on my results
In fact this sounds like a good idea. But I assume most of the stuff you have is bacterial. I am curious whether this approach would work equally well even when you have a mixture of bacterial and archaeal populations.
I have quite a few archaeal bins. I have refined most of my bins and I found two Bacteria-Archaea mixed bins so far. Bin_217:
Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 94.44% 61.11% = 155.55% Campbell et al bacteria 87.77% 94.96% = 182.73%
It is reported as Archaea, but after refining I got 2 bacterial (54 and 76% comp) and 1 archaeal (>90% comp) bins. With the new formula would be reported as Bacteria
This one is different: Bin_230
Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 76.54% 99.38% = 175.92% Campbell et al bacteria 92.81% 81.29% = 174.1%
This was reported as Bacteria, but it contained 2 archaea (both >60% comp) and 1 bacteria (>90% comp) after refining. It would have been reported as Archaea by a narrow margin.
So this would work better but these are just two examples...
Hi Xabi,
Very good points. What about reporting as "mixed" bins that are more than 10% redundant using both collections, and displaying the score for the collection with smallest redundancy value?
This way, going from one domain to another would not happen... It would still be a possible problem for bins with very low completion. Maybe we have to be stringent before assigning domain to bins. >50% completion and <10% redundant would be an acceptable solution, from my perspective.
Tom
Hi Tom, based on my very limited experience, bins that are very redundant usually produce bins of the same type after refining. So, the predicted domain is right... although the bin is crap... However, the completeness values of Rinke and Campbell are much more closer in mixed bins than in single-domain ones... regardless how good they are. Usually the completeness value of one SCG collection is about 1/2 to 2/3 of the other in single-domain bins, either good or bad bins, while when they are mixed the values are more similar.
In any case, I think that if it could be done, highlighting the "good bins" (>50%C/<10%R) would be a good idea. Going through few hundred bins can be a bit of a pain. You could even highlight the "useless bins" given that even refining wont give you anything useful, e.g. <50%C for both SCG collections
Hi, Regarding the useless bins comment, I think that we never know until you refine them (i.e., when below 10% redundant). "Useless" is a bit of a risky term, as many bins do not have a good score but might still be very relevant (eukaryotic, or viral maybe). But it is clear that we should not provide a domain name without a certain level of confidence, which is the case only when bins look good to a certain extent. Let's see what Meren says to that. It could be as simple as providing three categories: bacteria/Archaea/undefined
Thanks again for all the discussion Xabi!
Tom
I will respond to the rest in a bit, but I wanted to make a comment about this one:
In any case, I think that if it could be done, highlighting the "good bins" (>50%C/<10%R) would be a good idea.
You know, there is a quick way to highlight good bins. You should try the program anvi-rename-bins
before summary :) See --call-MAGs
flag and other parameters under the "MAG options" section:
$ anvi-rename-bins -h
usage: anvi-rename-bins [-h] -c CONTIGS_DB -p PROFILE_DB
[--collection-to-read COLLECTION_TO_READ]
[--collection-to-write COLLECTION_TO_WRITE]
[--prefix PREFIX] [--report-file REPORT_FILE_PATH]
[--list-collections] [--dry-run] [--call-MAGs]
[--min-completion-for-MAG [0-100]]
[--max-redundancy-for-MAG [0-100]]
[--size-for-MAG 0.1-10 Mbp] [--use-SCG-averages]
[--use-highest-completion-score]
Rename all bins in a given collection (so they have pretty names).
optional arguments:
-h, --help show this help message and exit
DEFAULT INPUTS:
Standard stuff
-c CONTIGS_DB, --contigs-db CONTIGS_DB
Anvi'o contigs database generated by 'anvi-gen-
contigs'
-p PROFILE_DB, --profile-db PROFILE_DB
Anvi'o profile database
--collection-to-read COLLECTION_TO_READ
Collection name to read from. Anvi'o will not
overwrite an existing collection, instead, it will
create a copy of your collection with new bin names.
--collection-to-write COLLECTION_TO_WRITE
The new collection name. Give it a nice, fancy name.
OUTPUT AND TESTING:
a.k.a, sweet parameters of convenience
--prefix PREFIX Prefix for the bin names. Must be a single word,
composed of digits and numbers. The use of the
underscore character is OK, but that's about it (fine,
the use of the dash character is OK, too but no
more!). If the prefix is 'PREFIX', each bin will be
renamed as 'PREFIX_XXX_00001, PREFIX_XXX_00002', and
so on, in the order of percent completion minus
percent redundancy (what we call, 'substantive
completion'). The 'XXX' part will either be 'Bin', or
'MAG depending on other parameters you use. Keep
reading.
--report-file REPORT_FILE_PATH
This file will report each name change event, so you
can trace back the original names of renamed bins
later.
--list-collections Show available collections and exit.
--dry-run When used does NOT update the profile database, just
creates the report file so you can view how things
will be renamed.
MAG OPTIONS:
If you want to call some bins 'MAGs' because you are so cool
--call-MAGs This program by default rename your bins as
'PREFIX_Bin_00001', 'PREFIX_Bin_00002' and so on. If
you use this flag, it will name the ones that meet the
criteria described by MAG-related flags as
'PREFIX_MAG_00001', 'PREFIX_MAG_00002', and so on. The
ones that do not get to be named as MAGs will remain
as bins.
--min-completion-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their completion estimate is above this (the default
is 70), and the redundancy estimate is less than
--max-redundancy-for-MAG.
--max-redundancy-for-MAG [0-100]
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is below this (the default
is 10) and the completion estimate is above --min-
completion-for-MAG.
--size-for-MAG 0.1-10 Mbp
If --call-MAGs flag is used, call any bin a 'MAG' if
their redundancy estimate is less than --max-
redundancy-for-MAG, and the size is larger than this
(the default is 2 Mbp), regarldless of the completion.
SCG OPTIONS:
Options related to single-copy genes. How should SCGs utilized to say what
should be considered a MAG?
--use-SCG-averages If you use this flag, anvi'o will use all avialble
single-copy core gene collections, will average their
independent completion and redundancy estimates, will
select the best matching domain (i.e. arcaheal SCGs vs
bacterial SCG), and use the resulting estimates
numbers for renaming purposes.
--use-highest-completion-score
If you use this flag, instead of `--use-SCG-averages`,
anvi'o will take the SCG collection that estimates the
highest completion for a given bin. This will affect
how MAGs are called. For instance, if you have HMM
hits in your contigs database for both bacterial and
archaeal single-copy gene collections, the completion
score for a given archaeal bin with a bacterial
single-copy genes will be very low. This flag will
select the collection with the highest completion
estimate, and use that instead of taking the average.
Oh! I didn't know about this one...
I tried a --dry-run
but it didn't create any report file
WARNING
===============================================
As per your request, this run will use the completion and redundancy estimate
recovered from the single-copy core gene collection that provides the highest
completion estimate.
Auxiliary Data ...............................: Found: contigs.h5 (v. 1)
Contigs DB ...................................: Initialized: contigs.db (v. 8)
WARNING
===============================================
This was a dry run, which means nothing is updated in the profile database.
Please take a look at the report filen and see whether things worked out the way
you wanted them to. If all looks alright, you will need to run the previous
commandline without the --dry-run flag.
PS: you got a couple of typos in the last warning message
I tried a --dry-run but it didn't create any report file
Crap. Does nothing work properly in this codebase! I will look into that in a minute.
you got a couple of typos in the last warning message
We have typos everywhere, Xabi. That is the sad truth.
Alright. Same story regarding the the program anvi-rename-bins
. The report file is now stored, but the fix will be in the master
only when we merge our up-to-date branch to it.
Meanwhile you can add new collections into the profile with a different name, and then use anvi-delete-collection
to remove them later.
Thanks. Sorry for bothering again, but I think the STDOUT from anvi-rename-bins
might be reporting a wrong number of renamed bins. I have 290 but it says 291. The report file has the correspondences for 290 bins indeed.
Thank you, Xabi :) Fixed now.
Hi Xabi,
To address the original complaint you made in this issue report I made a change (bc2350ed7c50c8318d6eaf184e702de3fe557ceb) in the new-version
branch, and it will be available in the next version :)
Best,
Hi Meren, How does anvi-summarize choose the completeness value to report in the index.html file? It does a pretty good job with the bins that are fairly complete with a certain level of redundancy, e.g.
Reports
However, when the bin is much worse, e.g. high redundancy regardless of the completeness levels, it usually reports the Rinke et al value by default, or at least in many of them.
Reports
I guess it doesn't matter too much considering that in the second case the Bin is rubbish at this point, but I was wondering if this behavior was expected or not.
Thanks, Xabi