Closed taltman closed 7 years ago
Also, out of 465 genes annotated as having an LCA of "root (1)", only 3 of them have any Blast hits! The rest have no blast hits. They definitely shouldn't be assigned a taxon of "root", implying that there is evidence of conservation across the tree of life. There's something seriously buggy here.
Okay got to the bottom of it. So the culprits are the default MEGAN thresholds for modifying the calculation of LCA. Although initially the column calculated vanilla LCA, we had some requests to make the column behave like MEGAN and so implemented these taxonomy thresholds.
Lets go through the above case. MEGAN has three thresholds on the annotation hits before calculating LCA:
min-score
(default 20): minimum annotation bitscore to consider for LCAtop-percent
(default 10%): within the top 10% of the top bitscore for ORFmin-support
(default 2): need at least two hits before bothering to calculate LCATurns out the defaults were quite restrictive, particularly the top % default of 10%. In the above case only the top hit "Vibrio alginolyticus NBRC 15630 = ATCC 17749" at score 117 remains after this top 10% exclusion.
This final taxonomy is then eliminated by the min_support
threshold:
Here's some debug statements after increasing the top 10% exclusion to top 20%:
Found 115_1 in compute_min_support_tree
[['Vibrio alginolyticus NBRC 15630 ATCC 17749'], ['Klebsiella pneumoniae'], ['Haemophilus influenzae KR494'], ['Haemophilus influenzae KR494'], ['Haemophilus influenzae KR494']]
Gammaproteobacteria
Which is the LCA that we expect for this annotation.
Since the current MP GUI has no way of setting or knowing about these MEGAN thresholds, its extremely unfair to execute on them. I'll make the defaults extremely liberal (min_score=0
, top-percent=100
, min-support=0
), that way the people who specify these options explicitly on the command like won't see any change in behaviour, and the default behaviour of the GUI won't apply thresholds that one shouldn't be expected to know about.
functional_and_taxonomic_table.txt now reads:
scaffolds_115_1 440 1017 1457 scaffolds_115 1597 - Gammaproteobacteria (1236) Cell wall-associated hydrolase
Thanks Niels! I will tweak these parameters. If only I could figure out where they go! I don't seem them in the params.txt file, nor in the command-line option documentation. How do I pass them in via the command line?
Hi Tomer
Those parameters are in: https://github.com/hallamlab/metapathways2/blob/master/libs/python_scripts/MetaPathways_create_reports_fast.py
Around line 119:
lca_options_group = OptionGroup(parser, 'LCA algorithm Options')
lca_options_group.add_option("--lca-min-score", dest="lca_min_score", type='float', default=50,
help='minimum BLAST/LAST score to consider as for LCA rule')
lca_options_group.add_option("--lca-top-percent", dest="lca_top_percent", type='float', default=10,
help='set of considered matches are within this percent of the highest score hit')
lca_options_group.add_option("--lca-min-support", dest="lca_min_support", type='int', default=2,
help='minimum number of reads that must be assigned to a taxon for ' +\
'that taxon to be present otherwise move up the tree until there ' +
'is a taxon that meets the requirement')
So these should be changed permissive values like 1, 100, and 1, respectively.
I’m going to push my changes tomorrow morning after doing a few tests.
Cheers, Niels
Thanks Niels! This should allow me to move forward.
To close this issue, these parameters should be exposed to the user via the config files, not by having them hand-tweak the source code.
Have the parameters been exposed via the config files?
My understanding is that MetaPathways uses the results of the sequence similarity search for identified proteins against the RefSeq proteome to compute an LCA score, as found in the results/annotation_table/functional_and_taxonomic_table.txt.
For one of my genes, scaffolds_115_1, with LCA of "root (1)" the file scaffolds.refseq-protein-v69.faa.BLASTout.parsed.txt has the following entries:
scaffolds_115_1 gi|543944110|ref|YP_008536566.1| 64 117 0.393939393939 1e-30 64.0 90.62 cell wall-associated hydrolase [Vibrio alginolyticus NBRC 15630 = ATCC 17749] scaffolds_115_1 gi|529985814|ref|YP_007989200.1| 58 102 0.343434343434 8e-26 58.0 91.38 lgt [Klebsiella pneumoniae] scaffolds_115_1 gi|543952674|ref|YP_008544748.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494] scaffolds_115_1 gi|543952632|ref|YP_008544706.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494] scaffolds_115_1 gi|543952609|ref|YP_008544683.1| 64 102 0.343434343434 8e-25 64.0 93.75 cell wall-associated hydrolase [Haemophilus influenzae KR494]
Note that the correct LCA should be Gammaproteobacteria, not root (1). What's going on here? Is this a bug, or is my understanding of how the LCA is computed incorrect?