Open MasaakiMatsubara opened 1 year ago
ResidueCode is representation of monosaccharide with its substituents.
More precisely, it represents a monosaccharide backbone and only subsituents which connect to only the backbone.
In WURCS string, the ResidueCodes represents unique monosaccharide residues containing the glycan. i.e., the ResidueCodes in a WURCS represent monosaccharide composition without the count in the glycan.
To extract only the ResidueCodes from WURCS list, grep
command is useful.
grep -oE "\[[^]]*?]" WURCS_list.txt
This command extract each ResidueCode (represented as a string from [
to ]
) from WURCS strings listed in text file WURCS_list.txt
.
An example of input and output is following:
Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
Output:
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]
If there are many WURCS in the list, they may contain more than one identical ResidueCode like the following:
Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
Output:
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]
In this case, the sort
and uniq
commands are useful to extract unique ones among them.
sort ResidueCode_list.txt | uniq
Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
Output:
[a1122h-1a]
[a1122h-1b]
[a2122h-1b_2*NCC/3=O]
The following commands are used to perform the individual processes in a single step:
cat WURCS_list.txt | grep -oE "\[[^]]*?]" | sort | uniq
To count the number of the unique ones, option -c
for uniq
can be used.
cat WURCS_list.txt | grep -oE "\[[^]]*?]" | sort | uniq -c
Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
Output:
2 [a1122h-1a]
2 [a1122h-1b]
2 [a2122h-1b_2*NCC/3=O]
Note that the count is not equal to the number of residues because the ResidueCodes in a WURCS represent monosaccharide composition without the count in the glycan as mentioned above.
This command just count how many WURCS in the list has the ResidueCodes.
MAP is a representation of substituent or linkage between two or more monosaccharides.
More precisely, it represents any atomic groups other than backbone carbon chains. However, since hydroxyl group, carbonyl group, ether ring and glycosidic bond are omitted, they are not represented as MAP in WURCS in default.
To extract the MAP from WURCS list, the following grep command can be used:
grep -oE "\*[^]_]*" WURCS_list.txt
This command extract each MAP (represented as a string from *
to before _
or ]
) from WURCS strings listed in text file WURCS_list.txt
.
An example of input and output is following:
Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
Output:
*NCC/3=O
To collect unique MAPs from WURCS list, the similar command which described at above comment (https://github.com/glycoinfo/WURCS/issues/2#issuecomment-1558803741) can be used like the following:
cat WURCS_list.txt | grep -oE "\*[^]_]*" | sort | uniq
WURCS is a notation for a unique string of glycan structure information; to decipher WURCS, it is usually necessary to parse the string with a dedicated program such as WURCSFramework and analyze it as meaningful structural data. On the other hand, since WURCS has various information such as monosaccharide backbones and modifications as strings, simple string manipulation is sufficient for simple analysis.
Thus, in this issue, we discuss a method for obtaining structural information from WURCS list by string manipulation using shell scripts and the like.