Open MasaakiMatsubara opened 1 year ago
Because there are so many aspects to discuss about what structural features can be used for the substituent filtering, the discussion is also to be complicated.
Then, to simplify the discussion, we are focusing on the following structural features at this point:
In general, chemical modifications make glycan structures diverse and complicated.
Thus we consider that chemical modifications should not be contained in the glycan substituents.
There are many discussions about "what are chemical modifications", but at this point, we defined that chemical modifications contain non-organic atoms.
Since there are some discussions about "organic atoms" as well, we also defined that the following elements are organic atoms:
These are basically elements contained in natural organic compounds.
Although some elements, e.g. F, can be considered to be a part of substituent because many compounds registered in chemical compound databases has them, they are used as some chemical modifications, e.g. labeling, in many cases.
Therefore, we limit organic elements more strictly.
This feature is for measuring complexity of the chemical structures, i.e. more branches, more complex.
Here, we consider that "branch" is an atom with three or more connections to heavy atoms. The number of branches is to be a number of the branched atoms, i.e. atom with three connections has a branch and one with four connections has two branches.
Note that the element of branched atom is not limited to carbon currently. Therefore, for example, P of phosphate and S of sulfate are also counted as branched atoms.
At this point, we determined that up to four branches are allowed in a substituent. This is considered from the list of substituents described in SNFG document to keep major substituents.
On the other hand, we dare not to use "the number of atoms" for filtering because some substituents can be large but have fewer branch, e.g. lipids.
This feature is to filter substituents with large ring.
Here, the "maximum ring size" means the ring size of a biggest ring of SSSR (smallest set of smallest rings).
This is to distinguish single rings from the fused rings. As we mentioned above, the number of branches is used for filtering as feature 2. This feature also filters polycyclic compounds because they have many branches, too. Therefore, we do not have to consider about ring size of polycyclic compounds.
Returning to the "maximum ring size", we need to consider about how much ring size is too large.
Basically, we consider that at least the macrocyclic compounds should be excluded as a part of glycans.
The "macrocycles" also has various definitions, but the many of those say that a ring of ten or twelve atoms is on the border line.
Thus, we determined that any substituents in a glycan must not have a ring of ten or more atoms in the SSSR.
WURCS has rule to represent various chemical structures.
However, it does not refer to what chemical structures are suitable as a part of glycans.
Especially for substituents, there is no clear rule for identifying itself in many cases, and thereby they become too much large and complicated.
Thus, in this issue, we discuss about how to filter unsuitable substituents.
Through this discussion, we also try to determine what substituents are suitable for a part of glycans.