Open bobaoai opened 4 years ago
This looks like a breakdown in how linear_code.glycan_to_linear_code
is calling linear_code.priority
. Right now, branches are selected in "priority" order, as defined in Table 1 of [1]. At some point a breaking change caused priority
to return -1 for any "real" monosaccharide. Returning the correct priority should fix this, but I have to be a bit more careful about how I do that, because of the degenerate way in which LinearCode encodes monosaccharides.
Note that LinearCode derived structures are necessarily not canonicalized w.r.t. to the same structure as parsed from GlycoCT or WURCS, and if you intend to mix the two formats, you should explicitly call glycan.canonicalize()
.
[1] Banin, E., Neuberger, Y., Altshuler, Y., Halevi, A., Inbar, O., Nir, D., & Dukler, A. (2002). A Novel Linear Code Nomenclature for Complex Carbohydrates. Trends in Glycoscience and Glycotechnology, 14(77), 127–137. https://doi.org/10.4052/tigg.14.127
Thanks!
I totally agreed that LinearCode derived structures might not be canonicalized and we should be careful when using it.
Since my data analysis is only dealing with the glycan with common monosaccharides, the extreme case doesn't both me. The reason the LinearCode is used is that in my case if two glycans are the same, their linearcodes are the same. In this case, the str1==str2 will be faster to check the similarity among a set of glycans. Do you have a faster way to compare if two glycans have same structure? Thanks!
The same uniqueness is applied to any comparison of canonicalized structures and formats. The GlycoCT
serializer is 5 times faster and enforces the GlycoCT
canonicalization sorting on the structure as it is converted to a string, so it should be better all around.
I'm not sure I'll have time to fix the LinearCode serialization issue this week.
Do you mean when we get the GlycoCT from the str(Glycan)
, it is already canonicalized? So do you imply the glypy guarantees that every Glycan
object with the same structure will have the same str(Glycan)
? I mean currently, I only deal with the glycans with clearly specified topology and linkages.
It's totally okay. There is no push to fix the LinearCode serialization. It all depends on your schedule.
Yes, GlycoCTWriter
traverses the glycan by traveling edges in the order specified by the publication [1] (impl). Of course, as we've discussed before, glypy
doesn't write out UND
sections at the moment, but will canonicalize the current configuration of an under-determined structure.
The glycoct
module is well over 2k LOC, so it needs some serious refactoring before anyone else will be able to read it and retain anything about its organization.
[1] Herget, S., Ranzinger, R., Maass, K., & Lieth, C.-W. V. D. (2008). GlycoCT-a unifying sequence format for carbohydrates. Carbohydrate Research, 343(12), 2162–2171. https://doi.org/10.1016/j.carres.2008.03.011
Hey Joshua,
I am afraid that the linearcode.dumps() function might have inconsistency with the linear code rule. For example, if we have a glycan as below:
We should get 'Ma3(Ma3(Ma6)Ma6)Mb4GNb4GN?' But the
dumps()
returns 'Ma6(Ma3)Ma6(Ma3)Mb4GNb4GN?'. I believe it only needs a slight modification. When the code traverses the glycan, it need sort the glycan.index descendingly to make the monosaccharide with the highest linkage-index go first.