Closed qiyunzhu closed 4 years ago
Wait, you're saying this always terminates when it reaches "__"? Ben raises a good point (although the toy example is hopefully unrealistic in that these bugs should both clearly be marked at the kingdom level).
I have seen blanks in taxonomies at lower levels (i.e. the class of a virus is unassigned so appears blank even though the taxonomy exists after that level). Maybe a bunch of placeholder species names without a genus yet having some strain info associated with them -- LCA should still be able to reconcile the species if neither has an associated genus.
If LCA would stop at all cases after the first encountered __;, we'd be regressing in accuracy for some taxids like this.
One way around this is to use the ncbi tids as the hierarchy, or node tips (like the TOL), or a different taxonomy with blanks reconciled with unique placeholders (GTDB).
On Mon, Dec 16, 2019, 2:06 PM Benjamin Hillmann notifications@github.com wrote:
@bhillmann requested changes on this pull request.
Awaiting comments from @danknights https://github.com/danknights for approval about how we want to treat the __.
In shogun/utils/last_common_ancestor.py https://github.com/knights-lab/SHOGUN/pull/28#discussion_r358409937:
taxa_set = [_.split(';') for _ in taxa_set]
- for i, level in enumerate(reversed(list(zip(*taxa_set)))):
- if len(set(level)) == 1:
- convert_i = lambda x: None if x == 0 else -x
- return ';'.join([taxon_level[0] for taxon_level in list(zip(*taxa_set))[:convert_i(i)]])
- return None
- for level in zip(*taxa_set):
- if len(set(level)) > 1 or any(_.endswith('_') for in level):
@danknights https://github.com/danknights I'm a bit concerned if this is the correct way we want to treat blanks, that is, taxonomy followed by two "__". Take the following code test example:
def test_blanks_kingdom_class(self): # normal scenario taxa = [ 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium', 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium', 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium', 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Caminicella' ] obs = least_common_ancestor(taxa) exp = 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae' # self.assertEqual(obs, exp) self.assertEquals(obs, None)
Should this result in None or the exp = 'k;pFirmicutes;c;oClostridiales;f__Clostridiaceae'? I'm thinking exp.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knights-lab/SHOGUN/pull/28?email_source=notifications&email_token=AB5NOBTILZSDEP3SRAUKHB3QY7GTRA5CNFSM4JZF3SVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPLA5YY#pullrequestreview-332795619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBTZGZ357D7XYP6FOZLQY7GTRANCNFSM4JZF3SVA .
@bhillmann @GabeAl Thanks for your comments! I think it is feasible to get the function work in the way you suggested: by iterating from right and checking the entire lineage string to the leftmost. But in that way if there are multiple __
on the right side they won't get trimmed.
In my own utils the LCA function was written in a way that resembles @GabeAl 's last comment, that is, depending on a taxonomy tree instead of string comparison. The code is provided below. Though I think for the case of SHOGUN it may be overkill.
def find_lcas(taxa, taxtree):
"""Find lowest common ancestor(s) (LCA(s)) for given taxa."""
lcas = []
for taxon in taxa:
# list ancestors from current taxon to highest ancestor
lineage = [taxon]
this = taxon
while True:
parent = taxtree[this]
if parent is None or parent == this:
break
lineage.append(parent)
this = parent
lineage.reverse()
l = len(lineage)
# attempt to find shared ancestry with any existing LCA
found = False
for i, lca in enumerate(lcas):
anc = None
for j, taxon_ in enumerate(lca):
if j >= l or taxon_ != lineage[j]:
break
anc = taxon_
# shared ancestry found between current taxon and LCA
if anc is not None:
found = True
idx = lca.index(anc) + 1
# update existing LCA
if idx < len(lca):
lcas[i] = lca[:idx]
# only one LCA may share ancestry with current taxon
break
# add to LCA list
if not found:
lcas.append(lineage)
return [x[-1] for x in lcas]
Right, I think a good compromise for SHOGUN would be to left to right, and keep track of __
and cut them if there is nothing lower on the tree.
It would require a lot more code to incorporate a tree structure within SHOGUN rather than the string matching we currently do. Long term, the tree is probably the right solution.
I updated the LCA funciton in #30. Does that work for everybody, to revert to the last known LCA classified string if there is a right tail of unclassified taxonomies.
@bhillmann Thank you for the thoughts and edits! I will read shortly after some busy matters. Also ping'ing @fedarko which originally helped me to review the function.
This was resolved with #30 being merged. Following up with SHOGUN release 1.0.7 #29.
Dear SHOGUN team, following our previous discussion, we provide a patch which makes the LCA (lowest common ancestor) function more accurate. This code is unit-tested and has been reviewed by @fedarko .