mobiusklein / glypy

Glycan Analysis and Glycoinformatics Library for Python
Apache License 2.0
27 stars 14 forks source link

Support for undertermined topology #13

Open edwardsnj opened 6 years ago

edwardsnj commented 6 years ago

Thank you for embarking on this scary and uncertain path!

I took a look at the GlycoCT parsing implementation and how the resulting ambiguous parent links are represented in the Glycan/Monosaccharide/Link classes. Pretty similar to how I've prototyped this previously. I like how ambiguity in the the parent node and position are handled consistently.

First, a bug: AmbiguousLink.find_open_position fails when there are two undetermined sections with the same list of possible parents, but a single parent_position specified for each. The method only checks the parent data member (first parent from the list of possibilities), so the call to find_open_position for the second AmbiguousLink fails. This happens at GlycoCT parse time.

The GlyTouCan accession that showed me this issue is: G41857GD

Stack trace:

Traceback (most recent call last):
  File "test.py", line 95, in <module>
    g = glycoct.loads(s)
  File "/.../GlyPy/glypy/io/glycoct.py", line 1064, in loads
    first = next(g)
  File "/.../GlyPy/glypy/io/glycoct.py", line 671, in next
    return next(self._iter)
  File "/.../GlyPy/glypy/io/glycoct.py", line 1024, in parse
    self._complete_structure()
  File "/.../GlyPy/glypy/io/glycoct.py", line 937, in _complete_structure
    result = self.postprocess()
  File "/.../GlyPy/glypy/io/glycoct.py", line 957, in postprocess
    level.postprocess()
  File "/.../GlyPy/glypy/io/glycoct.py", line 572, in postprocess
    link_obj.find_open_position()
  File "/.../GlyPy/glypy/structure/link.py", line 545, in find_open_position
    raise ValueError("Could not find a valid configurations on current parent/child pair")
ValueError: Could not find a valid configurations on current parent/child pair

Second, a feature request - implementation of an AmbiguousLink.iterconfiguration type method for Glycans, which does the combinatorics for all AmbiguousLink configurations (which interact with each other, as per the issue above). Perhaps the find_open_position logic will be sufficient, but I'm not sure.

Thanks...

mobiusklein commented 6 years ago

I traced the first issue to a problem with the configuration resolution algorithm where instead of failing because of redundant parents, the error was because the parent position was -1, which I hadn't handled properly at the time of writing. Simply adding permission for the unknown position case solved that problem.

Commit 1b67dc39ba6757f23bd49ff210baf9c8ea842da2 (master) has the fixed resolution algorithm.

As for the feature request, the algorithm is relatively straight-forward when unknown positions aren't present.

import itertools

def iterconfigurations_glycan(glycan):
    ambiguous_links = []

    # find all ambiguous links
    for i, link in glycan.iterlinks():
        if link.is_ambiguous():
            ambiguous_links.append(link)

    # produce cross-product of all possible configurations
    combos = itertools.product(*[link.iterconfiguration() for link in ambiguous_links])
    for configs in combos:
        # configurations where the same parent id and position appears multiple
        # times should be skipped
        if len(set([(c[0], c[2]) for c in configs])) < len(configs):
            continue
        # yield the tuples of pairings of (link, configuration)
        yield tuple(zip(ambiguous_links, configs))

Dealing with unknown positions is going to be more difficult since we need to track the available sites for each putative parent, which I'll need to think more about.

Did you want this "iterconfiguration for glycans" function to be a method on the Glycan class? It seems like the logical place to put it to me, though the class is getting pretty complicated with all its various iteration methods.

edwardsnj commented 6 years ago

By eye, I think the issue with not checking multiple potential parents is still likely to be an issue, but I guess I'll have to find another example to demonstrate that. :-)

Yes, on the surface, merely doing the product of all the link iterconfigurations is sufficient - the meat of the problem lies in correctly handling their interactions. Your if statement looks pretty good, except for the wildcard parent positions. Perhaps there is a case to be made for turning missing positions into an enumerated list of all possible ring positions - then the special cases for -1 could be dropped.

Yes, I imagined this in the Glycan class - yielding each possible configuration of the entire structure. Note that the enumeration can be very large.

mobiusklein commented 6 years ago

I can see what you mean. The find_open_position method was written when I was just dealing with ambiguous positions on the same residue. Instead, it could loop over all possible combinations of parents and children and select the first valid configuration. I've modified the structure from the original report to force that to happen, and found an algorithm that works around it. Before I push it here though, I want to stress it a bit more before I trust it's correct.

edwardsnj commented 6 years ago

Its seems like the two pieces of code Glycan.iterconfiguration and AmbigousLink.find_open_position are kinda implementing the same logic, though in different ways. I didn't like the ambiguous position implementation or the links stored in a position key'ed dictionary initially, but I can see its advantages in the attempt to figure out the interactions between AmbigousLinks.

mobiusklein commented 6 years ago

I've pushed a draft implementation of Glycan.iterconfiguration. The theoretical organization works for variable parents. It doesn't discriminate between redundant ambiguous configurations yet.

I'm concerned there's more state that needs to be tracked, but I don't have a collection of examples to work on.