Dump the glypy.Glycan to linear code

bobaoai commented 4 years ago

Hey Joshua,

I am afraid that the linearcode.dumps() function might have inconsistency with the linear code rule. For example, if we have a glycan as below:

RES
1b:x-dglc-HEX-1:5
2s:n-acetyl
3b:b-dglc-HEX-1:5
4s:n-acetyl
5b:b-dman-HEX-1:5
6b:a-dman-HEX-1:5
7b:a-dman-HEX-1:5
8b:a-dman-HEX-1:5
9b:a-dman-HEX-1:5
LIN
1:1d(2+1)2n
2:1o(4+1)3d
3:3d(2+1)4n
4:3o(4+1)5d
5:5o(3+1)6d
6:5o(6+1)7d
7:7o(3+1)8d
8:7o(6+1)9d

We should get 'Ma3(Ma3(Ma6)Ma6)Mb4GNb4GN?' But the dumps() returns 'Ma6(Ma3)Ma6(Ma3)Mb4GNb4GN?'. I believe it only needs a slight modification. When the code traverses the glycan, it need sort the glycan.index descendingly to make the monosaccharide with the highest linkage-index go first.

mobiusklein commented 4 years ago

This looks like a breakdown in how linear_code.glycan_to_linear_code is calling linear_code.priority. Right now, branches are selected in "priority" order, as defined in Table 1 of [1]. At some point a breaking change caused priority to return -1 for any "real" monosaccharide. Returning the correct priority should fix this, but I have to be a bit more careful about how I do that, because of the degenerate way in which LinearCode encodes monosaccharides.

Note that LinearCode derived structures are necessarily not canonicalized w.r.t. to the same structure as parsed from GlycoCT or WURCS, and if you intend to mix the two formats, you should explicitly call glycan.canonicalize().

[1] Banin, E., Neuberger, Y., Altshuler, Y., Halevi, A., Inbar, O., Nir, D., & Dukler, A. (2002). A Novel Linear Code Nomenclature for Complex Carbohydrates. Trends in Glycoscience and Glycotechnology, 14(77), 127–137. https://doi.org/10.4052/tigg.14.127

bobaoai commented 4 years ago

Thanks!

I totally agreed that LinearCode derived structures might not be canonicalized and we should be careful when using it.

Since my data analysis is only dealing with the glycan with common monosaccharides, the extreme case doesn't both me. The reason the LinearCode is used is that in my case if two glycans are the same, their linearcodes are the same. In this case, the str1==str2 will be faster to check the similarity among a set of glycans. Do you have a faster way to compare if two glycans have same structure? Thanks!

mobiusklein commented 4 years ago

The same uniqueness is applied to any comparison of canonicalized structures and formats. The GlycoCT serializer is 5 times faster and enforces the GlycoCT canonicalization sorting on the structure as it is converted to a string, so it should be better all around.

I'm not sure I'll have time to fix the LinearCode serialization issue this week.

bobaoai commented 4 years ago

Do you mean when we get the GlycoCT from the str(Glycan), it is already canonicalized? So do you imply the glypy guarantees that every Glycan object with the same structure will have the same str(Glycan)? I mean currently, I only deal with the glycans with clearly specified topology and linkages.

It's totally okay. There is no push to fix the LinearCode serialization. It all depends on your schedule.

mobiusklein commented 4 years ago

Yes, GlycoCTWriter traverses the glycan by traveling edges in the order specified by the publication [1] (impl). Of course, as we've discussed before, glypy doesn't write out UND sections at the moment, but will canonicalize the current configuration of an under-determined structure.

The glycoct module is well over 2k LOC, so it needs some serious refactoring before anyone else will be able to read it and retain anything about its organization.

[1] Herget, S., Ranzinger, R., Maass, K., & Lieth, C.-W. V. D. (2008). GlycoCT-a unifying sequence format for carbohydrates. Carbohydrate Research, 343(12), 2162–2171. https://doi.org/10.1016/j.carres.2008.03.011

mobiusklein / glypy

Dump the glypy.Glycan to linear code #20