mobiusklein / glypy

Glycan Analysis and Glycoinformatics Library for Python
Apache License 2.0
27 stars 14 forks source link

Bug in subtree_of #15

Closed bobaoai closed 6 years ago

bobaoai commented 6 years ago

Hi I have two glycans.

gly_a = """RES 1b:b-dman-HEX-1:5 2b:a-dman-HEX-1:5 3b:a-dman-HEX-1:5 4b:b-dglc-HEX-1:5 5b:b-dglc-HEX-1:5 6s:n-acetyl 7b:b-dgal-HEX-1:5 8s:n-acetyl 9b:a-dgro-dgal-NON-2:6|1:a|2:keto|3:d 10s:n-acetyl LIN 1:1o(3+1)2d 2:1o(6+1)3d 3:2o(4+1)4d 4:3o(6+1)5d 5:4d(2+1)6n 6:4o(4+1)7d 7:5d(2+1)8n 8:7o(3+2)9d 9:9d(5+1)10n""" and gly_b = """RES 1b:b-dman-HEX-1:5 2b:a-dman-HEX-1:5 3b:a-dman-HEX-1:5 4b:b-dglc-HEX-1:5 5b:b-dglc-HEX-1:5 6s:n-acetyl 7s:n-acetyl 8b:b-dgal-HEX-1:5 9b:a-dgro-dgal-NON-2:6|1:a|2:keto|3:d 10s:n-acetyl LIN 1:1o(3+1)2d 2:1o(6+1)3d 3:2o(4+1)4d 4:3o(6+1)5d 5:4d(2+1)6n 6:5d(2+1)7n 7:5o(4+1)8d 8:8o(3+2)9d 9:9d(5+1)10n"""

They have the same length and should have the same topology. However, subtree_of(gly_a, gly_b,False) returns 1 but subtree_of(gly_b, gly_a,False) returns nothing. I expect the second one should also return 1. Could you help me check it? Thanks!

mobiusklein commented 6 years ago

Thank you for reporting this. It looks like the result of a tricky case of branch ambiguity. Short answer - I need to change the matching process from first-matching child to best-matching child. I'll patch this within the next two days.

Here's a labeled drawing of gly_a image

and the same for gly_b image

The problem appears to be because the subtree search algorithm looks for inclusion, meaning it can match a branch by ignoring children. This means that if the query tree has two branches with a common base unit (GlcNAc here) but one is extended and the other is not, order begins to matter. If the structure that is being tested for inclusion on also has two branches, one that matches the extended branch and one that matches the short branch. If the short branch in the query tree is searched first, it will match the first branch it meets in this case, which could be either the long branch or the short branch. If it matches the long branch in the target, the long branch in the query won't be able to match and the query will fail to find any matches and subtree_of will return None.

On that note, does returning None make sense in this case? I reasoned that the function is used to test for numerical index of a subtree, and that if the subtree is absent, it's not an exceptional circumstance. That said, there are plenty of cases where that is the case (e.g. list.index raises ValueError). Since subtree_of returns a number when it succeeds I can't make returning a boolean signal the failure.

bobaoai commented 6 years ago

Hi, Thank you for the prompt reply. The algorithm you mentioned makes sense to me. Yeah, and I totally understand that it should return None in this circumstance.

These two structures are generated by breaking down a big N-glycan. We are trying to come up with the smallest set of topologyical-substructure in one N-glycan. This case was found when I tried to remove all the duplicates in one set.

`def _duplicate_cleaning_wrapper(idex, motif_list, cleaned_motif_dic):

ldex = 0
_check_list = motif_list
_ = []

while ldex < len(_check_list):
    jdex = ldex + 1
    _.append(ldex)
    while jdex < len(_check_list):
        try:
            if not subtree_of(_check_list[jdex], _check_list[ldex]) is None:
                del _check_list[jdex]
            # elif not subtree_of(_check_list[ldex], _check_list[jdex]) is None:
                # _check_list[ldex] = _check_list[ldex]
                # del _check_list[jdex]
            else:
                jdex += 1
        except AttributeError:
            print(ldex, jdex)
    # if not find_same:
    ldex += 1
# NBT_plot_glycan_utilities.plot_glycan_list(_check_list, _)
cleaned_motif_dic[idex] = _check_list`

I also add some wrapper to the plot function which can help add the plot title or plot a list of glycans. Hope you can find them helpful. NBT_plot_glycan_utilities.txt

mobiusklein commented 6 years ago

I have rewritten subtree_search.topological_inclusion, converting it into a more complex combinatorial paired tree traversal which should be guaranteed to find the optimal topological embedding, which fixes the problem you encountered with subtree_search.subtree_of. Changes in commit 3c631c4.

bobaoai commented 6 years ago

Hi Joshua, Thanks for updating the subtree_of function! I have one question about the plot you showed earlier. image In this graph, you have the order number for each glycan. How can I add my own customized label for each monosaccharide? Is okay that I just modify the enumerate_tree function? Thanks!

mobiusklein commented 6 years ago

The enumerate_tree function was originally written for debugging purposes, so it won't work if you use any orientation other than vertical, but it was designed to take a pluggable labeler:

def enumerate_tree(tree, ax, labeler=None):
    """Label each node of `tree` on `ax` with the node's :attr:`id`

    Parameters
    ----------
    tree: :class:`DrawTreeNode`
        A drawn :class:`DrawTreeNode` object
    ax: matplotlib.Axes
        The axes on which `tree` has been drawn

    Returns
    -------
    tree, ax
    """
    if labeler is None:
        labeler = operator.attrgetter('id')
    nodes = list(breadth_first_traversal(tree))
    for node in nodes:
        x, y = node.coords()
        ax.text(x, y + 0.2, labeler(node.tree), color='red')
    # Update the plot
    ax.get_figure().canvas.draw()
    return tree, ax

The labeler must be a callable that takes a Monosaccharide instance and returns the label.

If you need something more sophisticated, that could be trickier. Each node knows roughly where it should reside in unit scale geometry in the vertical orientation, by calling their coords() method. Each node also carries an instance of matplotlib.transform.TransformBase instance which can convert the values returned by coords(): node.transform.transform(node.coords()) to get the actual drawn coordinates.

Has the actual problem with subtree_search been resolved as far as you can tell? If so, we can close this issue.

bobaoai commented 6 years ago

Hi Joshua,

Thanks! The subtree_of problem is solved. We can close this issue.