How to collect structural information from WURCS list by string manipulation

MasaakiMatsubara commented 1 year ago

WURCS is a notation for a unique string of glycan structure information; to decipher WURCS, it is usually necessary to parse the string with a dedicated program such as WURCSFramework and analyze it as meaningful structural data. On the other hand, since WURCS has various information such as monosaccharide backbones and modifications as strings, simple string manipulation is sufficient for simple analysis.

Thus, in this issue, we discuss a method for obtaining structural information from WURCS list by string manipulation using shell scripts and the like.

MasaakiMatsubara commented 1 year ago

Extraction of ResidueCodes

ResidueCode is representation of monosaccharide with its substituents.
More precisely, it represents a monosaccharide backbone and only subsituents which connect to only the backbone.

In WURCS string, the ResidueCodes represents unique monosaccharide residues containing the glycan. i.e., the ResidueCodes in a WURCS represent monosaccharide composition without the count in the glycan.

To extract only the ResidueCodes from WURCS list, grep command is useful.

grep -oE "\[[^]]*?]" WURCS_list.txt

This command extract each ResidueCode (represented as a string from [ to ]) from WURCS strings listed in text file WURCS_list.txt.

An example of input and output is following:

Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1

Output:
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]

Collection of unique ResidueCodes from WURCS list

If there are many WURCS in the list, they may contain more than one identical ResidueCode like the following:

Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1

Output:
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]
[a2122h-1b_2*NCC/3=O]
[a1122h-1b]
[a1122h-1a]

In this case, the sort and uniq commands are useful to extract unique ones among them.

sort ResidueCode_list.txt | uniq

Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1

Output:
[a1122h-1a]
[a1122h-1b]
[a2122h-1b_2*NCC/3=O]

The following commands are used to perform the individual processes in a single step:

cat WURCS_list.txt | grep -oE "\[[^]]*?]" | sort | uniq

To count the unique ResidueCodes in WURCS list

To count the number of the unique ones, option -c for uniq can be used.

cat WURCS_list.txt | grep -oE "\[[^]]*?]" | sort | uniq -c

Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1

Output:
      2 [a1122h-1a]
      2 [a1122h-1b]
      2 [a2122h-1b_2*NCC/3=O]

Note that the count is not equal to the number of residues because the ResidueCodes in a WURCS represent monosaccharide composition without the count in the glycan as mentioned above.
This command just count how many WURCS in the list has the ResidueCodes.

MasaakiMatsubara commented 1 year ago

Extraction of MAPs

MAP is a representation of substituent or linkage between two or more monosaccharides.
More precisely, it represents any atomic groups other than backbone carbon chains. However, since hydroxyl group, carbonyl group, ether ring and glycosidic bond are omitted, they are not represented as MAP in WURCS in default.

To extract the MAP from WURCS list, the following grep command can be used:

grep -oE "\*[^]_]*" WURCS_list.txt

This command extract each MAP (represented as a string from * to before _ or ]) from WURCS strings listed in text file WURCS_list.txt.

An example of input and output is following:

Input (Content of WURCS_list.txt):
WURCS=2.0/3,5,4/[a2122h-1b_2*NCC/3=O][a1122h-1b][a1122h-1a]/1-1-2-3-3/a4-b1_b4-c1_c3-d1_c6-e1

Output:
*NCC/3=O

To collect unique MAPs from WURCS list, the similar command which described at above comment (https://github.com/glycoinfo/WURCS/issues/2#issuecomment-1558803741) can be used like the following:

cat WURCS_list.txt | grep -oE "\*[^]_]*" | sort | uniq

glycoinfo / WURCS