volkamerlab / opencadd

A Python library for structural cheminformatics
https://opencadd.readthedocs.io
MIT License
91 stars 18 forks source link

superposer is insenstive to residues with insertion codes #56

Open dominiquesydow opened 3 years ago

dominiquesydow commented 3 years ago

Hi @jaimergp,

We - that is @AndreaVolkamer, @MicheleWichmann and I - bumped into a potential issue in the superposer implementation. I would like to check with you if this is really the case - and you like to suggest a fix.

The problem

We want to align 3SQH and 1C4V. This works without errors but I think some residues that should be included in the structural alignment are omitted. I suspect that happens because they are notated with insertion codes in the PDB file, e.g. residue IDs 60, 60A, ..., 60I. Note: These residues are part of the gene sequence, see UniProt positions 458+.

Tracking down the problem in opencadd

The sequence alignment in superposer looks great (residues 60, 60A, ... highlighted with ** - bolt does not work because of code environment):

3SQH
1C4V

EADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL
---------------------------I----VEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL

TAAHCL**LYPPWDKNFT**ENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPV
TAAHCL**LYPPWDKNFT**ENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPV

AFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTAPSVLQVVNLPIVERPVCKDSTRIRITDNM
AFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETGQ-PSVLQVVNLPIVERPVCKDSTRIRITDNM

FCAGYKPDEGKRGDACEGDAGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVID
FCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVID

The structure residue IDs that get selected for the structural alignment are the following:

3SQH: 14 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 60 60 60 60 60 60 60 60 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 129 129 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 184 185 186 186 186 186 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 204 204 205 206 207 208 209 210 211 212 213 214 215 216 217 219 220 221 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247

1C4V: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 60 60 60 60 60 60 60 60 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 129 129 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 184 185 186 186 186 186 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 204 204 205 206 207 208 209 210 211 212 213 214 215 216 217 219 220 221 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247

Note: I took these numbers from the selection string ( resid 14 and segid E and ( name CA and not..., which is used to select residues from the MDA universe.

This means that only residue 60 is selected but not 60A, 60B, ..., 60I.

The solution (maybe)

Luckily, MDA does read in the insertion codes (icode). So we could add this to the selection string here.

I checked if all the insertion codes are as I would expect them (they do!), and will add this for completion here:

from opencadd.structure.core import Structure
s = Structure.from_pdbid("3sqh")
print(*[(i.resid, i.icode) for i in s.residues])

Output: (1, 'C') (1, 'B') (1, 'A') (1, '') (2, '') (3, '') (4, '') (5, '') (6, '') (7, '') (8, '') (9, '') (10, '') (11, '') (12, '') (13, '') (14, '') (14, 'A') (14, 'B') (14, 'C') (14, 'D') (14, 'E') (14, 'F') (14, 'G') (14, 'H') (14, 'I') (14, 'J') (14, 'K') (14, 'L') (14, 'M') (15, '') (16, '') (17, '') (18, '') (19, '') (20, '') (21, '') (22, '') (23, '') (24, '') (25, '') (26, '') (27, '') (28, '') (29, '') (30, '') (31, '') (32, '') (33, '') (34, '') (35, '') (36, '') (36, 'A') (37, '') (38, '') (39, '') (40, '') (41, '') (42, '') (43, '') (44, '') (45, '') (46, '') (47, '') (48, '') (49, '') (50, '') (51, '') (52, '') (53, '') (54, '') (55, '') (56, '') (57, '') (58, '') (59, '') (60, '') (60, 'A') (60, 'B') (60, 'C') (60, 'D') (60, 'E') (60, 'F') (60, 'G') (60, 'H') (60, 'I') (61, '') (62, '') (63, '') (64, '') (65, '') (66, '') (67, '') (68, '') (69, '') (70, '') (71, '') (72, '') (73, '') (74, '') (75, '') (76, '') (77, '') (77, 'A') (78, '') (79, '') (80, '') (81, '') (82, '') (83, '') (84, '') (85, '') (86, '') (87, '') (88, '') (89, '') (90, '') (91, '') (92, '') (93, '') (94, '') (95, '') (96, '') (97, '') (97, 'A') (98, '') (99, '') (100, '') (101, '') (102, '') (103, '') (104, '') (105, '') (106, '') (107, '') (108, '') (109, '') (110, '') (111, '') (112, '') (113, '') (114, '') (115, '') (116, '') (117, '') (118, '') (119, '') (120, '') (121, '') (122, '') (123, '') (124, '') (125, '') (126, '') (127, '') (128, '') (129, '') (129, 'A') (129, 'B') (129, 'C') (130, '') (131, '') (132, '') (133, '') (134, '') (135, '') (136, '') (137, '') (138, '') (139, '') (140, '') (141, '') (142, '') (143, '') (144, '') (145, '') (146, '') (147, '') (148, '') (149, '') (149, 'A') (152, '') (153, '') (154, '') (155, '') (156, '') (157, '') (158, '') (159, '') (160, '') (161, '') (162, '') (163, '') (164, '') (165, '') (166, '') (167, '') (168, '') (169, '') (170, '') (171, '') (172, '') (173, '') (174, '') (175, '') (176, '') (177, '') (178, '') (179, '') (180, '') (181, '') (182, '') (183, '') (184, '') (184, 'A') (185, '') (186, '') (186, 'A') (186, 'B') (186, 'C') (186, 'D') (187, '') (188, '') (189, '') (190, '') (191, '') (192, '') (193, '') (194, '') (195, '') (196, '') (197, '') (198, '') (199, '') (200, '') (201, '') (202, '') (203, '') (204, '') (204, 'A') (204, 'B') (205, '') (206, '') (207, '') (208, '') (209, '') (210, '') (211, '') (212, '') (213, '') (214, '') (215, '') (216, '') (217, '') (219, '') (220, '') (221, '') (221, 'A') (222, '') (223, '') (224, '') (225, '') (226, '') (227, '') (228, '') (229, '') (230, '') (231, '') (232, '') (233, '') (234, '') (235, '') (236, '') (237, '') (238, '') (239, '') (240, '') (241, '') (242, '') (243, '') (244, '') (245, '') (246, '') (247, '') (248, '') (249, '') (250, '') (251, '') (252, '') (253, '') (254, '') (255, '') (256, '') (257, '') (258, '') (259, '') (260, '') (261, '') (262, '') (263, '') (264, '') (265, '') (266, '') (267, '') (268, '') (269, '') (270, '') (271, '') (272, '') (273, '') (274, '') (275, '') (276, '') (277, '') (278, '') (279, '') (280, '') (281, '') (282, '') (283, '') (284, '') (285, '') (286, '') (287, '') (288, '') (289, '') (290, '') (291, '') (292, '') (293, '') (294, '') (295, '') (296, '') (297, '') (298, '') (299, '') (300, '') (301, '') (302, '') (303, '') (304, '') (305, '') (306, '') (307, '') (308, '') (309, '') (310, '') (311, '') (312, '') (313, '') (314, '') (315, '') (316, '') (317, '') (318, '') (319, '') (320, '') (321, '') (322, '') (323, '') (324, '') (325, '') (326, '') (327, '') (328, '') (329, '')

What do you think? Does including the insertion codes make sense to you? Or is there a reason to exclude them?

jaimergp commented 3 years ago

I can't think of any reason to not include them! To be honest this is the first example where I have seen insertion codes being used in the right way 😄 I don't know to what extent insertion codes are a PDB specific feature (or workaround...), but if MDanalysis exposes them we are fine. Most of the time they will be empty, apparently.

I would like to avoid running into a format problem though. We need to check that these icode fields are also there if the structure is loaded from a Mol2 or something that it's not a PDB file.

Once that's verified, we can open a PR to include this in the code as a generalization of the sequence. Hopefully it's not too complicated...

dominiquesydow commented 3 years ago

Oh, you are right. The icode fields might be handled differently in other file formats.

I suppose we will let this issue sit for now and look at it again when superposer is someone's main project (probably not end of 2020/beginning of 2021 :laughing:).