mobiusklein / glypy

Glycan Analysis and Glycoinformatics Library for Python
Apache License 2.0
27 stars 14 forks source link

Parsing glycan with UND structure #17

Open bobaoai opened 4 years ago

bobaoai commented 4 years ago

Hey Joshua,

I encountered a problem when dealing with the glycan with ambiguity structure as shown below. image The glycoct string I generated from GlycanBuilder is attached. As I read through the code, it looks like the writer can generate the glycoct string that contains 'SubtreeLinkageID1' but we cannot load the string, am I correct? If so, what change should I make on the string or how can I load/build a Glycan object that has the ambiguity structure? Is it possible for me to do the Glycan.fragments on this kind of structure?

glycoct.loads(
"""RES
1b:b-dglc-HEX-1:5
2s:n-acetyl
3b:b-dglc-HEX-1:5
4s:n-acetyl
5b:b-dman-HEX-1:5
6b:a-dman-HEX-1:5
7b:a-dman-HEX-1:5
8b:a-lgal-HEX-1:5|6:d
LIN
1:1d(2+1)2n
2:1o(4+1)3d
3:3d(2+1)4n
4:3o(4+1)5d
5:5o(3+1)6d
6:5o(6+1)7d
7:1o(6+1)8d
UND
UND1:100.0:100.0
ParentIDs:1|3|5|6|7|8
SubtreeLinkageID1:u(2+1)u
RES
9b:b-dglc-HEX-1:5
10s:n-acetyl
11b:b-dgal-HEX-1:5
LIN
8:9d(2+1)10n
9:9o(3+1)11d
""")
---------------------------------------------------------------------------
GlycoCTError                              Traceback (most recent call last)
<ipython-input-10-b5e3c0a66dfc> in <module>
     38 10:12d(2+1)13n
     39 11:12o(3+1)14d
---> 40 """)

/anaconda3/lib/python3.7/site-packages/glypy/io/glycoct.py in loads(text, structure_class, allow_repeats, allow_multiple)
   1330 
   1331     text_buffer = StringIO(text)
-> 1332     return load(text_buffer, structure_class, allow_repeats, allow_multiple)
   1333 
   1334 

/anaconda3/lib/python3.7/site-packages/glypy/io/glycoct.py in load(stream, structure_class, allow_repeats, allow_multiple)
   1299     """
   1300     g = GlycoCTReader(stream, structure_class=structure_class, allow_repeats=allow_repeats)
-> 1301     first = next(g)
   1302     if not allow_multiple:
   1303         return first

/anaconda3/lib/python3.7/site-packages/glypy/io/glycoct.py in next(self)
    888         if self._iter is None:
    889             iter(self)
--> 890         return next(self._iter)
    891 
    892     #: Alias for next. Supports Py3 Iterator interface

/anaconda3/lib/python3.7/site-packages/glypy/io/glycoct.py in parse(self)
   1251                 self.handle_repeat_inner(line)
   1252             elif line.strip()[:3] == UND:
-> 1253                 self.handle_und_inner(line)
   1254             elif ALT == line.strip():
   1255                 raise GlycoCTSectionUnsupported(ALT)

/anaconda3/lib/python3.7/site-packages/glypy/io/glycoct.py in handle_und_inner(self, line)
   1151         if match is None:
   1152             raise GlycoCTError("Could not interpret UND SubtreeLinkage %r at line %d" % (
-> 1153                 subtree_linkage_line, self._source_line))
   1154         else:
   1155             link_dict = match.groupdict()

GlycoCTError: Could not interpret UND SubtreeLinkage 'SubtreeLinkageID1:u(2+1)u' at line 21

Thanks for your help in advance!

mobiusklein commented 4 years ago

No, the GlycoCTReader can read underdetermined glycans. The problem here is that the GlycoCT string you got doesn't match the specification from the GlycoCT manual (Page 17). image

A linkage type may be specified as using any of the letters odhnxrs. glypy can interpret all but r and s because I have never seen either of those prochiral loss linkages. If I had to guess, you/GlycanBuilder intended for the linkage around the UND component to be unknown (going by the "u")? The appropriate way to denote that would be with an x.

I can patch the parser to support "u" in those positions and translate it to "x" in the next few days though if that is indeed the expected behavior.

bobaoai commented 4 years ago

Thanks for the reply. Yeah, it is kind of strange, but I have seen a lot of 'u' and I am literally taking 'u' as 'x' for a while, by locally modifying the glypy code. I will let you know if I find anything discussion regarding using the 'u'.

However, as I manually changed the u to o and d, the parser works but it looks like the output glycan is not same as the input by attaching the UND structure to the first possible parent node. Is this the expected case? Thanks for the help:)

"""RES
1b:b-dglc-HEX-1:5
2s:n-acetyl
3b:b-dglc-HEX-1:5
4s:n-acetyl
5b:b-dman-HEX-1:5
6b:a-dman-HEX-1:5
7b:a-dman-HEX-1:5
8b:a-lgal-HEX-1:5|6:d
LIN
1:1d(2+1)2n
2:1o(4+1)3d
3:3d(2+1)4n
4:3o(4+1)5d
5:5o(3+1)6d
6:5o(6+1)7d
7:1o(6+1)8d
UND
UND1:100.0:100.0
ParentIDs:7|8
SubtreeLinkageID1:o(2+1)d
RES
9b:b-dglc-HEX-1:5
10s:n-acetyl
11b:b-dgal-HEX-1:5
LIN
8:9d(2+1)10n
9:9o(3+1)11d
""")
a_glycan

RES 1b:b-dglc-HEX-1:5 2s:n-acetyl 3b:b-dglc-HEX-1:5 4s:n-acetyl 5b:b-dman-HEX-1:5 6b:a-dman-HEX-1:5 7b:b-dglc-HEX-1:5 8s:n-acetyl 9b:b-dgal-HEX-1:5 10b:a-dman-HEX-1:5 11b:a-lgal-HEX-1:5|6:d LIN 1:1d(2+1)2n 2:1o(4+1)3d 3:3d(2+1)4n 4:3o(4+1)5d 5:5o(3+1)6d 6:6o(2+1)7d 7:7d(2+1)8n 8:7o(3+1)9d 9:5o(6+1)10d 10:1o(6+1)11d

mobiusklein commented 4 years ago

Ah, right. glypy.io.glycoct supports reading UND sections, but doesn't know how to write them back out most of the time. It wasn't high on my priority list to support this at the time. The sub-tree linkage is created using an AmbiguousLink instead of a Link.

AmbiguousLink objects have a list of possible parents, parent positions, children, and child positions to choose from. When a Glycan has an undefined linkage or an ambiguous attachement site, you can iterate over the possible states using glycan.iterconfigurations(). See the iterconfigurations docstring for usage.

bobaoai commented 4 years ago

Thanks, Joshua! I will check it. I really appreciate that you are continuing to develop the glypy since you have been graduated for a while. What's the plan of improving the glypy? I am willing to help if I can.

Best, Bokan Bao

Bioinformatics & System Biology University of California, San Diego '21 bobao@eng.ucsd.edu | (607)-379-2615

On Tue, Sep 10, 2019 at 6:02 AM Joshua Klein notifications@github.com wrote:

Ah, right. glypy.io.glycoct supports reading UND sections, but doesn't know how to write them back out most of the time. It wasn't high on my priority list to support this at the time. The sub-tree linkage is created using an AmbiguousLink instead of a Link.

AmbiguousLink objects have a list of possible parents, parent positions, children, and child positions to choose from. When a Glycan has an undefined linkage or an ambiguous attachement site, you can iterate over the possible states using glycan.iterconfigurations(). See the iterconfigurations docstring for usage.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mobiusklein/glypy/issues/17?email_source=notifications&email_token=AFJHPX2UBF3VS5WOHM37AWLQI6LHZA5CNFSM4IVCMRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6K76YY#issuecomment-529923939, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJHPXZEJCWXEXFHPCBUANLQI6LHZANCNFSM4IVCMRWQ .

mobiusklein commented 4 years ago

You're welcome.

Right now, my main concern with glypy is to improve the documentation. My inexperience with Sphinx when I first set it up may mean substantial re-organization there. Eventually, I may do some performance tuning, but I do not have any specific plans for that at this time.

If you would like to contribute, I'd be happy to review pull requests and discuss ideas and applications you might have.