knights-lab / SHOGUN

SHallow shOtGUN profiler
GNU Affero General Public License v3.0
54 stars 19 forks source link

improve LCA function #28

Closed qiyunzhu closed 4 years ago

qiyunzhu commented 4 years ago

Dear SHOGUN team, following our previous discussion, we provide a patch which makes the LCA (lowest common ancestor) function more accurate. This code is unit-tested and has been reviewed by @fedarko .

GabeAl commented 4 years ago

Wait, you're saying this always terminates when it reaches "__"? Ben raises a good point (although the toy example is hopefully unrealistic in that these bugs should both clearly be marked at the kingdom level).

I have seen blanks in taxonomies at lower levels (i.e. the class of a virus is unassigned so appears blank even though the taxonomy exists after that level). Maybe a bunch of placeholder species names without a genus yet having some strain info associated with them -- LCA should still be able to reconcile the species if neither has an associated genus.

If LCA would stop at all cases after the first encountered __;, we'd be regressing in accuracy for some taxids like this.

One way around this is to use the ncbi tids as the hierarchy, or node tips (like the TOL), or a different taxonomy with blanks reconciled with unique placeholders (GTDB).

On Mon, Dec 16, 2019, 2:06 PM Benjamin Hillmann notifications@github.com wrote:

@bhillmann requested changes on this pull request.

Awaiting comments from @danknights https://github.com/danknights for approval about how we want to treat the __.

In shogun/utils/last_common_ancestor.py https://github.com/knights-lab/SHOGUN/pull/28#discussion_r358409937:

 taxa_set = [_.split(';') for _ in taxa_set]
  • for i, level in enumerate(reversed(list(zip(*taxa_set)))):
  • if len(set(level)) == 1:
  • convert_i = lambda x: None if x == 0 else -x
  • return ';'.join([taxon_level[0] for taxon_level in list(zip(*taxa_set))[:convert_i(i)]])
  • return None
  • for level in zip(*taxa_set):
  • if len(set(level)) > 1 or any(_.endswith('_') for in level):

@danknights https://github.com/danknights I'm a bit concerned if this is the correct way we want to treat blanks, that is, taxonomy followed by two "__". Take the following code test example:

def test_blanks_kingdom_class(self):
    # normal scenario
    taxa = [
        'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium',
        'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium',
        'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Clostridium',
        'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae;g__Caminicella'
    ]
    obs = least_common_ancestor(taxa)
    exp = 'k__;p__Firmicutes;c__;o__Clostridiales;f__Clostridiaceae'
    # self.assertEqual(obs, exp)
    self.assertEquals(obs, None)

Should this result in None or the exp = 'k;pFirmicutes;c;oClostridiales;f__Clostridiaceae'? I'm thinking exp.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knights-lab/SHOGUN/pull/28?email_source=notifications&email_token=AB5NOBTILZSDEP3SRAUKHB3QY7GTRA5CNFSM4JZF3SVKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCPLA5YY#pullrequestreview-332795619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBTZGZ357D7XYP6FOZLQY7GTRANCNFSM4JZF3SVA .

qiyunzhu commented 4 years ago

@bhillmann @GabeAl Thanks for your comments! I think it is feasible to get the function work in the way you suggested: by iterating from right and checking the entire lineage string to the leftmost. But in that way if there are multiple __ on the right side they won't get trimmed.

In my own utils the LCA function was written in a way that resembles @GabeAl 's last comment, that is, depending on a taxonomy tree instead of string comparison. The code is provided below. Though I think for the case of SHOGUN it may be overkill.

def find_lcas(taxa, taxtree):
    """Find lowest common ancestor(s) (LCA(s)) for given taxa."""
    lcas = []
    for taxon in taxa:

        # list ancestors from current taxon to highest ancestor
        lineage = [taxon]
        this = taxon
        while True:
            parent = taxtree[this]
            if parent is None or parent == this:
                break
            lineage.append(parent)
            this = parent
        lineage.reverse()
        l = len(lineage)

        # attempt to find shared ancestry with any existing LCA
        found = False
        for i, lca in enumerate(lcas):
            anc = None
            for j, taxon_ in enumerate(lca):
                if j >= l or taxon_ != lineage[j]:
                    break
                anc = taxon_

            # shared ancestry found between current taxon and LCA
            if anc is not None:
                found = True
                idx = lca.index(anc) + 1

                # update existing LCA
                if idx < len(lca):
                    lcas[i] = lca[:idx]

                # only one LCA may share ancestry with current taxon
                break

        # add to LCA list
        if not found:
            lcas.append(lineage)
    return [x[-1] for x in lcas]
bhillmann commented 4 years ago

Right, I think a good compromise for SHOGUN would be to left to right, and keep track of __ and cut them if there is nothing lower on the tree.

It would require a lot more code to incorporate a tree structure within SHOGUN rather than the string matching we currently do. Long term, the tree is probably the right solution.

bhillmann commented 4 years ago

I updated the LCA funciton in #30. Does that work for everybody, to revert to the last known LCA classified string if there is a right tail of unclassified taxonomies.

qiyunzhu commented 4 years ago

@bhillmann Thank you for the thoughts and edits! I will read shortly after some busy matters. Also ping'ing @fedarko which originally helped me to review the function.

bhillmann commented 4 years ago

This was resolved with #30 being merged. Following up with SHOGUN release 1.0.7 #29.