hallamlab / metapathways2

MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds
http://hallam.microbiology.ubc.ca/MetaPathways/
33 stars 14 forks source link

Incorrect LCA computation #58

Closed taltman closed 7 years ago

taltman commented 9 years ago

My understanding is that MetaPathways uses the results of the sequence similarity search for identified proteins against the RefSeq proteome to compute an LCA score, as found in the results/annotation_table/functional_and_taxonomic_table.txt.

For one of my genes, scaffolds_115_1, with LCA of "root (1)" the file scaffolds.refseq-protein-v69.faa.BLASTout.parsed.txt has the following entries:

scaffolds_115_1 gi|543944110|ref|YP_008536566.1| 64 117 0.393939393939 1e-30 64.0 90.62 cell wall-associated hydrolase [Vibrio alginolyticus NBRC 15630 = ATCC 17749] scaffolds_115_1 gi|529985814|ref|YP_007989200.1| 58 102 0.343434343434 8e-26 58.0 91.38 lgt [Klebsiella pneumoniae] scaffolds_115_1 gi|543952674|ref|YP_008544748.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494] scaffolds_115_1 gi|543952632|ref|YP_008544706.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494] scaffolds_115_1 gi|543952609|ref|YP_008544683.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494]

Note that the correct LCA should be Gammaproteobacteria, not root (1). What's going on here? Is this a bug, or is my understanding of how the LCA is computed incorrect?

taltman commented 9 years ago

Also, out of 465 genes annotated as having an LCA of "root (1)", only 3 of them have any Blast hits! The rest have no blast hits. They definitely shouldn't be assigned a taxon of "root", implying that there is evidence of conservation across the tree of life. There's something seriously buggy here.

nielshanson commented 9 years ago

Okay got to the bottom of it. So the culprits are the default MEGAN thresholds for modifying the calculation of LCA. Although initially the column calculated vanilla LCA, we had some requests to make the column behave like MEGAN and so implemented these taxonomy thresholds.

Lets go through the above case. MEGAN has three thresholds on the annotation hits before calculating LCA:

Turns out the defaults were quite restrictive, particularly the top % default of 10%. In the above case only the top hit "Vibrio alginolyticus NBRC 15630 = ATCC 17749" at score 117 remains after this top 10% exclusion.

This final taxonomy is then eliminated by the min_support threshold:

Here's some debug statements after increasing the top 10% exclusion to top 20%:

Found 115_1 in compute_min_support_tree
[['Vibrio alginolyticus NBRC 15630   ATCC 17749'], ['Klebsiella pneumoniae'], ['Haemophilus influenzae KR494'], ['Haemophilus influenzae KR494'], ['Haemophilus influenzae KR494']]
Gammaproteobacteria

Which is the LCA that we expect for this annotation.

Since the current MP GUI has no way of setting or knowing about these MEGAN thresholds, its extremely unfair to execute on them. I'll make the defaults extremely liberal (min_score=0, top-percent=100, min-support=0), that way the people who specify these options explicitly on the command like won't see any change in behaviour, and the default behaviour of the GUI won't apply thresholds that one shouldn't be expected to know about.

nielshanson commented 9 years ago

functional_and_taxonomic_table.txt now reads:

scaffolds_115_1 440 1017    1457    scaffolds_115   1597    -       Gammaproteobacteria (1236)  Cell wall-associated hydrolase
taltman commented 9 years ago

Thanks Niels! I will tweak these parameters. If only I could figure out where they go! I don't seem them in the params.txt file, nor in the command-line option documentation. How do I pass them in via the command line?

nielshanson commented 9 years ago

Hi Tomer

Those parameters are in: https://github.com/hallamlab/metapathways2/blob/master/libs/python_scripts/MetaPathways_create_reports_fast.py

Around line 119:

lca_options_group =  OptionGroup(parser, 'LCA algorithm Options')
     lca_options_group.add_option("--lca-min-score", dest="lca_min_score",  type='float', default=50,
                       help='minimum BLAST/LAST score to consider as for LCA rule')
     lca_options_group.add_option("--lca-top-percent", dest="lca_top_percent",  type='float', default=10,
                       help='set of considered matches are within this percent of the highest score hit')
     lca_options_group.add_option("--lca-min-support", dest="lca_min_support",  type='int', default=2,
                       help='minimum number of reads that must be assigned to a taxon for ' +\
                            'that taxon to be present otherwise move up the tree until there ' +
                            'is a taxon that meets the requirement')

So these should be changed permissive values like 1, 100, and 1, respectively.

I’m going to push my changes tomorrow morning after doing a few tests.

Cheers, Niels

taltman commented 9 years ago

Thanks Niels! This should allow me to move forward.

taltman commented 7 years ago

To close this issue, these parameters should be exposed to the user via the config files, not by having them hand-tweak the source code.

taltman commented 7 years ago

Have the parameters been exposed via the config files?