merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
437 stars 144 forks source link

anvi-summarize in 2.1.0 #444

Closed xvazquezc closed 7 years ago

xvazquezc commented 7 years ago

Hi Meren, How does anvi-summarize choose the completeness value to report in the index.html file? It does a pretty good job with the bins that are fairly complete with a certain level of redundancy, e.g.

Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 51.23% 27.78% Campbell et al bacteria 95.68% 23.02%

Reports

95.68% 23.02%

However, when the bin is much worse, e.g. high redundancy regardless of the completeness levels, it usually reports the Rinke et al value by default, or at least in many of them.

Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 51.23% 101.23% Campbell et al bacteria 100.00% 221.58%

Reports

51.23% 101.23%

I guess it doesn't matter too much considering that in the second case the Bin is rubbish at this point, but I was wondering if this behavior was expected or not.

Thanks, Xabi

meren commented 7 years ago

Hi Xabi,

Interesting point. To decide which one to show, anvi'o is using completion - redundancy. In your case this value is -50 and -121 for Rinke et al and Campbell et al, which makes it look like Rinke et al is more suitable to make sense of this bin. Which is clearly not correct :)

I am not sure how to address this one. Do you or others have a suggestion?

Best,

xvazquezc commented 7 years ago

I see... I guess that completion + redundancy could work even in the most redundant bins based on my results

meren commented 7 years ago

In fact this sounds like a good idea. But I assume most of the stuff you have is bacterial. I am curious whether this approach would work equally well even when you have a mixture of bacterial and archaeal populations.

xvazquezc commented 7 years ago

I have quite a few archaeal bins. I have refined most of my bins and I found two Bacteria-Archaea mixed bins so far. Bin_217:

Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 94.44% 61.11% = 155.55% Campbell et al bacteria 87.77% 94.96% = 182.73%

It is reported as Archaea, but after refining I got 2 bacterial (54 and 76% comp) and 1 archaeal (>90% comp) bins. With the new formula would be reported as Bacteria

This one is different: Bin_230

Source SCG domain Percent completeness Percent redundancy Rinke et al archaea 76.54% 99.38% = 175.92% Campbell et al bacteria 92.81% 81.29% = 174.1%

This was reported as Bacteria, but it contained 2 archaea (both >60% comp) and 1 bacteria (>90% comp) after refining. It would have been reported as Archaea by a narrow margin.

So this would work better but these are just two examples...

tdelmont commented 7 years ago

Hi Xabi,

Very good points. What about reporting as "mixed" bins that are more than 10% redundant using both collections, and displaying the score for the collection with smallest redundancy value?

This way, going from one domain to another would not happen... It would still be a possible problem for bins with very low completion. Maybe we have to be stringent before assigning domain to bins. >50% completion and <10% redundant would be an acceptable solution, from my perspective.

Tom

xvazquezc commented 7 years ago

Hi Tom, based on my very limited experience, bins that are very redundant usually produce bins of the same type after refining. So, the predicted domain is right... although the bin is crap... However, the completeness values of Rinke and Campbell are much more closer in mixed bins than in single-domain ones... regardless how good they are. Usually the completeness value of one SCG collection is about 1/2 to 2/3 of the other in single-domain bins, either good or bad bins, while when they are mixed the values are more similar.

In any case, I think that if it could be done, highlighting the "good bins" (>50%C/<10%R) would be a good idea. Going through few hundred bins can be a bit of a pain. You could even highlight the "useless bins" given that even refining wont give you anything useful, e.g. <50%C for both SCG collections

tdelmont commented 7 years ago

Hi, Regarding the useless bins comment, I think that we never know until you refine them (i.e., when below 10% redundant). "Useless" is a bit of a risky term, as many bins do not have a good score but might still be very relevant (eukaryotic, or viral maybe). But it is clear that we should not provide a domain name without a certain level of confidence, which is the case only when bins look good to a certain extent. Let's see what Meren says to that. It could be as simple as providing three categories: bacteria/Archaea/undefined

Thanks again for all the discussion Xabi!

Tom

meren commented 7 years ago

I will respond to the rest in a bit, but I wanted to make a comment about this one:

In any case, I think that if it could be done, highlighting the "good bins" (>50%C/<10%R) would be a good idea.

You know, there is a quick way to highlight good bins. You should try the program anvi-rename-bins before summary :) See --call-MAGs flag and other parameters under the "MAG options" section:

$ anvi-rename-bins -h
usage: anvi-rename-bins [-h] -c CONTIGS_DB -p PROFILE_DB
                        [--collection-to-read COLLECTION_TO_READ]
                        [--collection-to-write COLLECTION_TO_WRITE]
                        [--prefix PREFIX] [--report-file REPORT_FILE_PATH]
                        [--list-collections] [--dry-run] [--call-MAGs]
                        [--min-completion-for-MAG [0-100]]
                        [--max-redundancy-for-MAG [0-100]]
                        [--size-for-MAG 0.1-10 Mbp] [--use-SCG-averages]
                        [--use-highest-completion-score]

Rename all bins in a given collection (so they have pretty names).

optional arguments:
  -h, --help            show this help message and exit

DEFAULT INPUTS:
  Standard stuff

  -c CONTIGS_DB, --contigs-db CONTIGS_DB
                        Anvi'o contigs database generated by 'anvi-gen-
                        contigs'
  -p PROFILE_DB, --profile-db PROFILE_DB
                        Anvi'o profile database
  --collection-to-read COLLECTION_TO_READ
                        Collection name to read from. Anvi'o will not
                        overwrite an existing collection, instead, it will
                        create a copy of your collection with new bin names.
  --collection-to-write COLLECTION_TO_WRITE
                        The new collection name. Give it a nice, fancy name.

OUTPUT AND TESTING:
  a.k.a, sweet parameters of convenience

  --prefix PREFIX       Prefix for the bin names. Must be a single word,
                        composed of digits and numbers. The use of the
                        underscore character is OK, but that's about it (fine,
                        the use of the dash character is OK, too but no
                        more!). If the prefix is 'PREFIX', each bin will be
                        renamed as 'PREFIX_XXX_00001, PREFIX_XXX_00002', and
                        so on, in the order of percent completion minus
                        percent redundancy (what we call, 'substantive
                        completion'). The 'XXX' part will either be 'Bin', or
                        'MAG depending on other parameters you use. Keep
                        reading.
  --report-file REPORT_FILE_PATH
                        This file will report each name change event, so you
                        can trace back the original names of renamed bins
                        later.
  --list-collections    Show available collections and exit.
  --dry-run             When used does NOT update the profile database, just
                        creates the report file so you can view how things
                        will be renamed.

MAG OPTIONS:
  If you want to call some bins 'MAGs' because you are so cool

  --call-MAGs           This program by default rename your bins as
                        'PREFIX_Bin_00001', 'PREFIX_Bin_00002' and so on. If
                        you use this flag, it will name the ones that meet the
                        criteria described by MAG-related flags as
                        'PREFIX_MAG_00001', 'PREFIX_MAG_00002', and so on. The
                        ones that do not get to be named as MAGs will remain
                        as bins.
  --min-completion-for-MAG [0-100]
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their completion estimate is above this (the default
                        is 70), and the redundancy estimate is less than
                        --max-redundancy-for-MAG.
  --max-redundancy-for-MAG [0-100]
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their redundancy estimate is below this (the default
                        is 10) and the completion estimate is above --min-
                        completion-for-MAG.
  --size-for-MAG 0.1-10 Mbp
                        If --call-MAGs flag is used, call any bin a 'MAG' if
                        their redundancy estimate is less than --max-
                        redundancy-for-MAG, and the size is larger than this
                        (the default is 2 Mbp), regarldless of the completion.

SCG OPTIONS:
  Options related to single-copy genes. How should SCGs utilized to say what
  should be considered a MAG?

  --use-SCG-averages    If you use this flag, anvi'o will use all avialble
                        single-copy core gene collections, will average their
                        independent completion and redundancy estimates, will
                        select the best matching domain (i.e. arcaheal SCGs vs
                        bacterial SCG), and use the resulting estimates
                        numbers for renaming purposes.
  --use-highest-completion-score
                        If you use this flag, instead of `--use-SCG-averages`,
                        anvi'o will take the SCG collection that estimates the
                        highest completion for a given bin. This will affect
                        how MAGs are called. For instance, if you have HMM
                        hits in your contigs database for both bacterial and
                        archaeal single-copy gene collections, the completion
                        score for a given archaeal bin with a bacterial
                        single-copy genes will be very low. This flag will
                        select the collection with the highest completion
                        estimate, and use that instead of taking the average.
xvazquezc commented 7 years ago

Oh! I didn't know about this one... I tried a --dry-run but it didn't create any report file


WARNING
===============================================
As per your request, this run will use the completion and redundancy estimate
recovered from the single-copy core gene collection that provides the highest
completion estimate.

Auxiliary Data ...............................: Found: contigs.h5 (v. 1)
Contigs DB ...................................: Initialized: contigs.db (v. 8)

WARNING
===============================================
This was a dry run, which means nothing is updated in the profile database.
Please take a look at the report filen and see whether things worked out the way
you wanted them to. If all looks alright, you will need to run the previous
commandline without the --dry-run flag.

PS: you got a couple of typos in the last warning message

meren commented 7 years ago

I tried a --dry-run but it didn't create any report file

Crap. Does nothing work properly in this codebase! I will look into that in a minute.

you got a couple of typos in the last warning message

We have typos everywhere, Xabi. That is the sad truth.

meren commented 7 years ago

Alright. Same story regarding the the program anvi-rename-bins. The report file is now stored, but the fix will be in the master only when we merge our up-to-date branch to it.

Meanwhile you can add new collections into the profile with a different name, and then use anvi-delete-collection to remove them later.

xvazquezc commented 7 years ago

Thanks. Sorry for bothering again, but I think the STDOUT from anvi-rename-bins might be reporting a wrong number of renamed bins. I have 290 but it says 291. The report file has the correspondences for 290 bins indeed.

meren commented 7 years ago

Thank you, Xabi :) Fixed now.

meren commented 7 years ago

Hi Xabi,

To address the original complaint you made in this issue report I made a change (bc2350ed7c50c8318d6eaf184e702de3fe557ceb) in the new-version branch, and it will be available in the next version :)

Best,