jyaacoub / MutDTA

Improving the precision oncology pipeline by providing binding affinity purtubations predictions on a pirori identified cancer driver genes.
1 stars 2 forks source link

Find Correct Kiba Structures for edge weighting #34

Closed jyaacoub closed 11 months ago

jyaacoub commented 1 year ago

The downloaded kiba structures using UniProtIDs on https://www.uniprot.org/id-mapping failed to work. They map to structures that do not match with the sequences provided in the dataset.

This is an issue since the dataset comes with alignment files and predicted contact maps that depend on that input sequence being the same.

Potential solutions:

  1. Go through each UniProt and identify structures that have 100% match with input sequences (instead of just getting the first structure from the mapping list).
  2. Redo alignment and just replace contact maps with real structures (essentially building a new dataset)
  3. Use Predicted structures w/alphafold

(1) is the most ideal but difficult to do (also unknown if the correct structures are even available), (3) is the easiest but might have unforeseeable consequences during training and (2) is a middle ground between both but would require retraining previous models to get updated results and confirmation that the dataset performance has not been altered too much.

Mismatch info:

Pretty much all of the available PDB structures are mismatched.

LENGTH MISMATCH FOR O00141 - /cluster/home/t122995uhn/projects/data/kiba/structures/O00141.pdb (Expected: 431, Actual: 284)
LENGTH MISMATCH FOR O00311 - /cluster/home/t122995uhn/projects/data/kiba/structures/O00311.pdb (Expected: 574, Actual: 312)
LENGTH MISMATCH FOR O00329 - /cluster/home/t122995uhn/projects/data/kiba/structures/O00329.pdb (Expected: 1044, Actual: 922)
LENGTH MISMATCH FOR O00418 - /cluster/home/t122995uhn/projects/data/kiba/structures/O00418.pdb (Expected: 725, Actual: 148)
LENGTH MISMATCH FOR O00444 - /cluster/home/t122995uhn/projects/data/kiba/structures/O00444.pdb (Expected: 970, Actual: 88)
LENGTH MISMATCH FOR O14757 - /cluster/home/t122995uhn/projects/data/kiba/structures/O14757.pdb (Expected: 476, Actual: 272)
LENGTH MISMATCH FOR O14920 - /cluster/home/t122995uhn/projects/data/kiba/structures/O14920.pdb (Expected: 756, Actual: 62)
LENGTH MISMATCH FOR O14965 - /cluster/home/t122995uhn/projects/data/kiba/structures/O14965.pdb (Expected: 403, Actual: 261)
LENGTH MISMATCH FOR O15075 - /cluster/home/t122995uhn/projects/data/kiba/structures/O15075.pdb (Expected: 740, Actual: 105)
LENGTH MISMATCH FOR O15111 - /cluster/home/t122995uhn/projects/data/kiba/structures/O15111.pdb (Expected: 745, Actual: 62)
LENGTH MISMATCH FOR O15264 - /cluster/home/t122995uhn/projects/data/kiba/structures/O15264.pdb (Expected: 365, Actual: 344)
LENGTH MISMATCH FOR O15530 - /cluster/home/t122995uhn/projects/data/kiba/structures/O15530.pdb (Expected: 556, Actual: 284)
LENGTH MISMATCH FOR O43293 - /cluster/home/t122995uhn/projects/data/kiba/structures/O43293.pdb (Expected: 454, Actual: 275)
LENGTH MISMATCH FOR O43741 - /cluster/home/t122995uhn/projects/data/kiba/structures/O43741.pdb (Expected: 272, Actual: 89)
LENGTH MISMATCH FOR O43781 - /cluster/home/t122995uhn/projects/data/kiba/structures/O43781.pdb (Expected: 588, Actual: 391)
LENGTH MISMATCH FOR O60674 - /cluster/home/t122995uhn/projects/data/kiba/structures/O60674.pdb (Expected: 1132, Actual: 287)
LENGTH MISMATCH FOR O75116 - /cluster/home/t122995uhn/projects/data/kiba/structures/O75116.pdb (Expected: 1388, Actual: 388)
LENGTH MISMATCH FOR O75582 - /cluster/home/t122995uhn/projects/data/kiba/structures/O75582.pdb (Expected: 802, Actual: 319)
LENGTH MISMATCH FOR O94806 - /cluster/home/t122995uhn/projects/data/kiba/structures/O94806.pdb (Expected: 890, Actual: 129)
LENGTH MISMATCH FOR O95819 - /cluster/home/t122995uhn/projects/data/kiba/structures/O95819.pdb (Expected: 1239, Actual: 297)
LENGTH MISMATCH FOR O96013 - /cluster/home/t122995uhn/projects/data/kiba/structures/O96013.pdb (Expected: 591, Actual: 274)
LENGTH MISMATCH FOR O96017 - /cluster/home/t122995uhn/projects/data/kiba/structures/O96017.pdb (Expected: 543, Actual: 116)
LENGTH MISMATCH FOR P00519 - /cluster/home/t122995uhn/projects/data/kiba/structures/P00519.pdb (Expected: 1130, Actual: 109)
LENGTH MISMATCH FOR P00533 - /cluster/home/t122995uhn/projects/data/kiba/structures/P00533.pdb (Expected: 1210, Actual: 511)
LENGTH MISMATCH FOR P04049 - /cluster/home/t122995uhn/projects/data/kiba/structures/P04049.pdb (Expected: 648, Actual: 167)
LENGTH MISMATCH FOR P04626 - /cluster/home/t122995uhn/projects/data/kiba/structures/P04626.pdb (Expected: 1255, Actual: 95)
LENGTH MISMATCH FOR P04629 - /cluster/home/t122995uhn/projects/data/kiba/structures/P04629.pdb (Expected: 796, Actual: 107)
LENGTH MISMATCH FOR P05129 - /cluster/home/t122995uhn/projects/data/kiba/structures/P05129.pdb (Expected: 697, Actual: 77)
LENGTH MISMATCH FOR P05771 - /cluster/home/t122995uhn/projects/data/kiba/structures/P05771.pdb (Expected: 671, Actual: 326)
LENGTH MISMATCH FOR P06213 - /cluster/home/t122995uhn/projects/data/kiba/structures/P06213.pdb (Expected: 1382, Actual: 300)
LENGTH MISMATCH FOR P06239 - /cluster/home/t122995uhn/projects/data/kiba/structures/P06239.pdb (Expected: 509, Actual: 105)
LENGTH MISMATCH FOR P06241 - /cluster/home/t122995uhn/projects/data/kiba/structures/P06241.pdb (Expected: 537, Actual: 58)
LENGTH MISMATCH FOR P06493 - /cluster/home/t122995uhn/projects/data/kiba/structures/P06493.pdb (Expected: 297, Actual: 292)
LENGTH MISMATCH FOR P07332 - /cluster/home/t122995uhn/projects/data/kiba/structures/P07332.pdb (Expected: 822, Actual: 114)
LENGTH MISMATCH FOR P07333 - /cluster/home/t122995uhn/projects/data/kiba/structures/P07333.pdb (Expected: 972, Actual: 303)
LENGTH MISMATCH FOR P07947 - /cluster/home/t122995uhn/projects/data/kiba/structures/P07947.pdb (Expected: 543, Actual: 59)
LENGTH MISMATCH FOR P07948 - /cluster/home/t122995uhn/projects/data/kiba/structures/P07948.pdb (Expected: 512, Actual: 60)
LENGTH MISMATCH FOR P07949 - /cluster/home/t122995uhn/projects/data/kiba/structures/P07949.pdb (Expected: 1114, Actual: 284)
LENGTH MISMATCH FOR P08069 - /cluster/home/t122995uhn/projects/data/kiba/structures/P08069.pdb (Expected: 1367, Actual: 471)
LENGTH MISMATCH FOR P08581 - /cluster/home/t122995uhn/projects/data/kiba/structures/P08581.pdb (Expected: 1390, Actual: 98)
LENGTH MISMATCH FOR P08631 - /cluster/home/t122995uhn/projects/data/kiba/structures/P08631.pdb (Expected: 526, Actual: 437)
LENGTH MISMATCH FOR P08922 - /cluster/home/t122995uhn/projects/data/kiba/structures/P08922.pdb (Expected: 2347, Actual: 281)
LENGTH MISMATCH FOR P09619 - /cluster/home/t122995uhn/projects/data/kiba/structures/P09619.pdb (Expected: 1106, Actual: 91)
LENGTH MISMATCH FOR P09769 - /cluster/home/t122995uhn/projects/data/kiba/structures/P09769.pdb (Expected: 529, Actual: 71)
LENGTH MISMATCH FOR P10721 - /cluster/home/t122995uhn/projects/data/kiba/structures/P10721.pdb (Expected: 976, Actual: 290)
LENGTH MISMATCH FOR P11309 - /cluster/home/t122995uhn/projects/data/kiba/structures/P11309.pdb (Expected: 404, Actual: 277)
LENGTH MISMATCH FOR P11362 - /cluster/home/t122995uhn/projects/data/kiba/structures/P11362.pdb (Expected: 822, Actual: 278)
LENGTH MISMATCH FOR P11802 - /cluster/home/t122995uhn/projects/data/kiba/structures/P11802.pdb (Expected: 303, Actual: 267)
LENGTH MISMATCH FOR P12931 - /cluster/home/t122995uhn/projects/data/kiba/structures/P12931.pdb (Expected: 536, Actual: 105)
LENGTH MISMATCH FOR P15056 - /cluster/home/t122995uhn/projects/data/kiba/structures/P15056.pdb (Expected: 766, Actual: 264)
LENGTH MISMATCH FOR P15735 - /cluster/home/t122995uhn/projects/data/kiba/structures/P15735.pdb (Expected: 406, Actual: 284)
LENGTH MISMATCH FOR P16234 - /cluster/home/t122995uhn/projects/data/kiba/structures/P16234.pdb (Expected: 1089, Actual: 91)
LENGTH MISMATCH FOR P16591 - /cluster/home/t122995uhn/projects/data/kiba/structures/P16591.pdb (Expected: 822, Actual: 116)
LENGTH MISMATCH FOR P17252 - /cluster/home/t122995uhn/projects/data/kiba/structures/P17252.pdb (Expected: 672, Actual: 85)
LENGTH MISMATCH FOR P17612 - /cluster/home/t122995uhn/projects/data/kiba/structures/P17612.pdb (Expected: 351, Actual: 335)
LENGTH MISMATCH FOR P17948 - /cluster/home/t122995uhn/projects/data/kiba/structures/P17948.pdb (Expected: 1338, Actual: 98)
LENGTH MISMATCH FOR P19784 - /cluster/home/t122995uhn/projects/data/kiba/structures/P19784.pdb (Expected: 350, Actual: 334)
LENGTH MISMATCH FOR P21802 - /cluster/home/t122995uhn/projects/data/kiba/structures/P21802.pdb (Expected: 821, Actual: 202)
LENGTH MISMATCH FOR P22455 - /cluster/home/t122995uhn/projects/data/kiba/structures/P22455.pdb (Expected: 802, Actual: 275)
LENGTH MISMATCH FOR P22607 - /cluster/home/t122995uhn/projects/data/kiba/structures/P22607.pdb (Expected: 806, Actual: 213)
LENGTH MISMATCH FOR P23443 - /cluster/home/t122995uhn/projects/data/kiba/structures/P23443.pdb (Expected: 525, Actual: 263)
LENGTH MISMATCH FOR P23458 - /cluster/home/t122995uhn/projects/data/kiba/structures/P23458.pdb (Expected: 1154, Actual: 280)
LENGTH MISMATCH FOR P24723 - /cluster/home/t122995uhn/projects/data/kiba/structures/P24723.pdb (Expected: 683, Actual: 140)
LENGTH MISMATCH FOR P24941 - /cluster/home/t122995uhn/projects/data/kiba/structures/P24941.pdb (Expected: 298, Actual: 277)
LENGTH MISMATCH FOR P27361 - /cluster/home/t122995uhn/projects/data/kiba/structures/P27361.pdb (Expected: 379, Actual: 351)
LENGTH MISMATCH FOR P27448 - /cluster/home/t122995uhn/projects/data/kiba/structures/P27448.pdb (Expected: 753, Actual: 322)
LENGTH MISMATCH FOR P28482 - /cluster/home/t122995uhn/projects/data/kiba/structures/P28482.pdb (Expected: 360, Actual: 333)
LENGTH MISMATCH FOR P29317 - /cluster/home/t122995uhn/projects/data/kiba/structures/P29317.pdb (Expected: 976, Actual: 265)
LENGTH MISMATCH FOR P29323 - /cluster/home/t122995uhn/projects/data/kiba/structures/P29323.pdb (Expected: 1055, Actual: 77)
LENGTH MISMATCH FOR P29376 - /cluster/home/t122995uhn/projects/data/kiba/structures/P29376.pdb (Expected: 864, Actual: 301)
LENGTH MISMATCH FOR P29597 - /cluster/home/t122995uhn/projects/data/kiba/structures/P29597.pdb (Expected: 1187, Actual: 287)
LENGTH MISMATCH FOR P30291 - /cluster/home/t122995uhn/projects/data/kiba/structures/P30291.pdb (Expected: 646, Actual: 259)
LENGTH MISMATCH FOR P30530 - /cluster/home/t122995uhn/projects/data/kiba/structures/P30530.pdb (Expected: 894, Actual: 380)
LENGTH MISMATCH FOR P31749 - /cluster/home/t122995uhn/projects/data/kiba/structures/P31749.pdb (Expected: 480, Actual: 115)
LENGTH MISMATCH FOR P31751 - /cluster/home/t122995uhn/projects/data/kiba/structures/P31751.pdb (Expected: 481, Actual: 271)
LENGTH MISMATCH FOR P34947 - /cluster/home/t122995uhn/projects/data/kiba/structures/P34947.pdb (Expected: 590, Actual: 529)
LENGTH MISMATCH FOR P35916 - /cluster/home/t122995uhn/projects/data/kiba/structures/P35916.pdb (Expected: 1363, Actual: 213)
LENGTH MISMATCH FOR P35968 - /cluster/home/t122995uhn/projects/data/kiba/structures/P35968.pdb (Expected: 1356, Actual: 275)
LENGTH MISMATCH FOR P36507 - /cluster/home/t122995uhn/projects/data/kiba/structures/P36507.pdb (Expected: 400, Actual: 303)
LENGTH MISMATCH FOR P36888 - /cluster/home/t122995uhn/projects/data/kiba/structures/P36888.pdb (Expected: 993, Actual: 298)
LENGTH MISMATCH FOR P41240 - /cluster/home/t122995uhn/projects/data/kiba/structures/P41240.pdb (Expected: 450, Actual: 246)
LENGTH MISMATCH FOR P41279 - /cluster/home/t122995uhn/projects/data/kiba/structures/P41279.pdb (Expected: 467, Actual: 302)
LENGTH MISMATCH FOR P41743 - /cluster/home/t122995uhn/projects/data/kiba/structures/P41743.pdb (Expected: 596, Actual: 89)
LENGTH MISMATCH FOR P42336 - /cluster/home/t122995uhn/projects/data/kiba/structures/P42336.pdb (Expected: 1068, Actual: 158)
LENGTH MISMATCH FOR P42345 - /cluster/home/t122995uhn/projects/data/kiba/structures/P42345.pdb (Expected: 2549, Actual: 94)
LENGTH MISMATCH FOR P42679 - /cluster/home/t122995uhn/projects/data/kiba/structures/P42679.pdb (Expected: 507, Actual: 97)
LENGTH MISMATCH FOR P42684 - /cluster/home/t122995uhn/projects/data/kiba/structures/P42684.pdb (Expected: 1182, Actual: 119)
LENGTH MISMATCH FOR P43403 - /cluster/home/t122995uhn/projects/data/kiba/structures/P43403.pdb (Expected: 619, Actual: 388)
LENGTH MISMATCH FOR P43405 - /cluster/home/t122995uhn/projects/data/kiba/structures/P43405.pdb (Expected: 635, Actual: 254)
LENGTH MISMATCH FOR P45983 - /cluster/home/t122995uhn/projects/data/kiba/structures/P45983.pdb (Expected: 427, Actual: 321)
LENGTH MISMATCH FOR P45984 - /cluster/home/t122995uhn/projects/data/kiba/structures/P45984.pdb (Expected: 424, Actual: 342)
LENGTH MISMATCH FOR P48729 - /cluster/home/t122995uhn/projects/data/kiba/structures/P48729.pdb (Expected: 337, Actual: 790)
LENGTH MISMATCH FOR P48730 - /cluster/home/t122995uhn/projects/data/kiba/structures/P48730.pdb (Expected: 415, Actual: 286)
LENGTH MISMATCH FOR P48736 - /cluster/home/t122995uhn/projects/data/kiba/structures/P48736.pdb (Expected: 1102, Actual: 841)
LENGTH MISMATCH FOR P49137 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49137.pdb (Expected: 400, Actual: 319)
LENGTH MISMATCH FOR P49336 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49336.pdb (Expected: 464, Actual: 321)
LENGTH MISMATCH FOR P49674 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49674.pdb (Expected: 416, Actual: 284)
LENGTH MISMATCH FOR P49759 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49759.pdb (Expected: 484, Actual: 333)
LENGTH MISMATCH FOR P49760 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49760.pdb (Expected: 499, Actual: 347)
LENGTH MISMATCH FOR P49841 - /cluster/home/t122995uhn/projects/data/kiba/structures/P49841.pdb (Expected: 420, Actual: 355)
LENGTH MISMATCH FOR P50613 - /cluster/home/t122995uhn/projects/data/kiba/structures/P50613.pdb (Expected: 346, Actual: 286)
LENGTH MISMATCH FOR P50750 - /cluster/home/t122995uhn/projects/data/kiba/structures/P50750.pdb (Expected: 372, Actual: 291)
LENGTH MISMATCH FOR P51617 - /cluster/home/t122995uhn/projects/data/kiba/structures/P51617.pdb (Expected: 712, Actual: 301)
LENGTH MISMATCH FOR P51812 - /cluster/home/t122995uhn/projects/data/kiba/structures/P51812.pdb (Expected: 740, Actual: 299)
LENGTH MISMATCH FOR P51813 - /cluster/home/t122995uhn/projects/data/kiba/structures/P51813.pdb (Expected: 675, Actual: 110)
LENGTH MISMATCH FOR P51955 - /cluster/home/t122995uhn/projects/data/kiba/structures/P51955.pdb (Expected: 445, Actual: 253)
LENGTH MISMATCH FOR P52333 - /cluster/home/t122995uhn/projects/data/kiba/structures/P52333.pdb (Expected: 1124, Actual: 288)
LENGTH MISMATCH FOR P52564 - /cluster/home/t122995uhn/projects/data/kiba/structures/P52564.pdb (Expected: 334, Actual: 339)
LENGTH MISMATCH FOR P53350 - /cluster/home/t122995uhn/projects/data/kiba/structures/P53350.pdb (Expected: 603, Actual: 224)
LENGTH MISMATCH FOR P53667 - /cluster/home/t122995uhn/projects/data/kiba/structures/P53667.pdb (Expected: 647, Actual: 290)
LENGTH MISMATCH FOR P53778 - /cluster/home/t122995uhn/projects/data/kiba/structures/P53778.pdb (Expected: 367, Actual: 327)
LENGTH MISMATCH FOR P53779 - /cluster/home/t122995uhn/projects/data/kiba/structures/P53779.pdb (Expected: 464, Actual: 346)
LENGTH MISMATCH FOR P54619 - /cluster/home/t122995uhn/projects/data/kiba/structures/P54619.pdb (Expected: 331, Actual: 143)
LENGTH MISMATCH FOR P54646 - /cluster/home/t122995uhn/projects/data/kiba/structures/P54646.pdb (Expected: 552, Actual: 256)
LENGTH MISMATCH FOR P54760 - /cluster/home/t122995uhn/projects/data/kiba/structures/P54760.pdb (Expected: 987, Actual: 185)
LENGTH MISMATCH FOR P67870 - /cluster/home/t122995uhn/projects/data/kiba/structures/P67870.pdb (Expected: 215, Actual: 328)
LENGTH MISMATCH FOR P68400 - /cluster/home/t122995uhn/projects/data/kiba/structures/P68400.pdb (Expected: 391, Actual: 336)
LENGTH MISMATCH FOR P78368 - /cluster/home/t122995uhn/projects/data/kiba/structures/P78368.pdb (Expected: 415, Actual: 290)
LENGTH MISMATCH FOR P80192 - /cluster/home/t122995uhn/projects/data/kiba/structures/P80192.pdb (Expected: 1104, Actual: 245)
LENGTH MISMATCH FOR Q00534 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q00534.pdb (Expected: 326, Actual: 269)
LENGTH MISMATCH FOR Q00535 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q00535.pdb (Expected: 292, Actual: 278)
LENGTH MISMATCH FOR Q02156 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q02156.pdb (Expected: 737, Actual: 227)
LENGTH MISMATCH FOR Q02750 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q02750.pdb (Expected: 393, Actual: 289)
LENGTH MISMATCH FOR Q02763 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q02763.pdb (Expected: 1124, Actual: 300)
LENGTH MISMATCH FOR Q02779 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q02779.pdb (Expected: 954, Actual: 62)
LENGTH MISMATCH FOR Q04759 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q04759.pdb (Expected: 706, Actual: 280)
LENGTH MISMATCH FOR Q04771 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q04771.pdb (Expected: 509, Actual: 312)
LENGTH MISMATCH FOR Q04912 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q04912.pdb (Expected: 1400, Actual: 298)
LENGTH MISMATCH FOR Q05397 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q05397.pdb (Expected: 1052, Actual: 142)
LENGTH MISMATCH FOR Q05655 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q05655.pdb (Expected: 676, Actual: 126)
LENGTH MISMATCH FOR Q06187 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q06187.pdb (Expected: 659, Actual: 67)
LENGTH MISMATCH FOR Q06418 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q06418.pdb (Expected: 890, Actual: 174)
LENGTH MISMATCH FOR Q07912 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q07912.pdb (Expected: 1038, Actual: 184)
LENGTH MISMATCH FOR Q08881 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q08881.pdb (Expected: 620, Actual: 245)
LENGTH MISMATCH FOR Q12866 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q12866.pdb (Expected: 999, Actual: 124)
LENGTH MISMATCH FOR Q13131 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13131.pdb (Expected: 559, Actual: 332)
LENGTH MISMATCH FOR Q13153 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13153.pdb (Expected: 545, Actual: 287)
LENGTH MISMATCH FOR Q13177 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13177.pdb (Expected: 524, Actual: 355)
LENGTH MISMATCH FOR Q13188 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13188.pdb (Expected: 491, Actual: 47)
LENGTH MISMATCH FOR Q13237 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13237.pdb (Expected: 762, Actual: 150)
LENGTH MISMATCH FOR Q13464 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13464.pdb (Expected: 1354, Actual: 179)
LENGTH MISMATCH FOR Q13554 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13554.pdb (Expected: 666, Actual: 289)
LENGTH MISMATCH FOR Q13555 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13555.pdb (Expected: 558, Actual: 137)
LENGTH MISMATCH FOR Q13557 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13557.pdb (Expected: 499, Actual: 301)
LENGTH MISMATCH FOR Q13627 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13627.pdb (Expected: 763, Actual: 346)
LENGTH MISMATCH FOR Q13882 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13882.pdb (Expected: 451, Actual: 100)
LENGTH MISMATCH FOR Q13976 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q13976.pdb (Expected: 671, Actual: 36)
LENGTH MISMATCH FOR Q14012 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q14012.pdb (Expected: 370, Actual: 250)
LENGTH MISMATCH FOR Q14289 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q14289.pdb (Expected: 1009, Actual: 135)
LENGTH MISMATCH FOR Q14680 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q14680.pdb (Expected: 651, Actual: 311)
LENGTH MISMATCH FOR Q15078 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q15078.pdb (Expected: 307, Actual: 278)
LENGTH MISMATCH FOR Q15118 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q15118.pdb (Expected: 436, Actual: 365)
LENGTH MISMATCH FOR Q15303 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q15303.pdb (Expected: 1308, Actual: 615)
LENGTH MISMATCH FOR Q15418 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q15418.pdb (Expected: 735, Actual: 285)
LENGTH MISMATCH FOR Q15759 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q15759.pdb (Expected: 364, Actual: 347)
LENGTH MISMATCH FOR Q16288 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16288.pdb (Expected: 839, Actual: 105)
LENGTH MISMATCH FOR Q16512 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16512.pdb (Expected: 942, Actual: 182)
LENGTH MISMATCH FOR Q16513 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16513.pdb (Expected: 984, Actual: 331)
LENGTH MISMATCH FOR Q16539 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16539.pdb (Expected: 360, Actual: 351)
LENGTH MISMATCH FOR Q16566 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16566.pdb (Expected: 473, Actual: 278)
LENGTH MISMATCH FOR Q16584 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16584.pdb (Expected: 847, Actual: 77)
LENGTH MISMATCH FOR Q16620 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16620.pdb (Expected: 822, Actual: 121)
LENGTH MISMATCH FOR Q16644 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q16644.pdb (Expected: 382, Actual: 273)
LENGTH MISMATCH FOR Q5S007 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q5S007.pdb (Expected: 2527, Actual: 156)
LENGTH MISMATCH FOR Q7KZI7 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q7KZI7.pdb (Expected: 788, Actual: 313)
LENGTH MISMATCH FOR Q8IU85 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q8IU85.pdb (Expected: 385, Actual: 278)
LENGTH MISMATCH FOR Q96GD4 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96GD4.pdb (Expected: 344, Actual: 253)
LENGTH MISMATCH FOR Q96KB5 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96KB5.pdb (Expected: 322, Actual: 303)
LENGTH MISMATCH FOR Q96L34 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96L34.pdb (Expected: 752, Actual: 304)
LENGTH MISMATCH FOR Q96RG2 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96RG2.pdb (Expected: 1323, Actual: 114)
LENGTH MISMATCH FOR Q96RR4 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96RR4.pdb (Expected: 588, Actual: 256)
LENGTH MISMATCH FOR Q96SB4 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q96SB4.pdb (Expected: 655, Actual: 353)
LENGTH MISMATCH FOR Q99683 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q99683.pdb (Expected: 1374, Actual: 263)
LENGTH MISMATCH FOR Q9BUB5 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9BUB5.pdb (Expected: 465, Actual: 242)
LENGTH MISMATCH FOR Q9BZL6 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9BZL6.pdb (Expected: 878, Actual: 125)
LENGTH MISMATCH FOR Q9H2G2 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9H2G2.pdb (Expected: 1235, Actual: 288)
LENGTH MISMATCH FOR Q9H2X6 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9H2X6.pdb (Expected: 1198, Actual: 331)
LENGTH MISMATCH FOR Q9H4B4 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9H4B4.pdb (Expected: 646, Actual: 281)
LENGTH MISMATCH FOR Q9HAZ1 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9HAZ1.pdb (Expected: 481, Actual: 329)
LENGTH MISMATCH FOR Q9HBH9 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9HBH9.pdb (Expected: 465, Actual: 277)
LENGTH MISMATCH FOR Q9HCP0 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9HCP0.pdb (Expected: 422, Actual: 294)
LENGTH MISMATCH FOR Q9NWZ3 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9NWZ3.pdb (Expected: 460, Actual: 294)
LENGTH MISMATCH FOR Q9NYL2 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9NYL2.pdb (Expected: 800, Actual: 287)
LENGTH MISMATCH FOR Q9P1W9 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9P1W9.pdb (Expected: 311, Actual: 249)
LENGTH MISMATCH FOR Q9P289 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9P289.pdb (Expected: 416, Actual: 275)
LENGTH MISMATCH FOR Q9UBF8 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9UBF8.pdb (Expected: 816, Actual: 80)
LENGTH MISMATCH FOR Q9UEE5 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9UEE5.pdb (Expected: 414, Actual: 266)
LENGTH MISMATCH FOR Q9UHD2 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9UHD2.pdb (Expected: 729, Actual: 89)
LENGTH MISMATCH FOR Q9UM73 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9UM73.pdb (Expected: 1620, Actual: 146)
LENGTH MISMATCH FOR Q9UQM7 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9UQM7.pdb (Expected: 478, Actual: 294)
LENGTH MISMATCH FOR Q9Y243 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9Y243.pdb (Expected: 479, Actual: 116)
LENGTH MISMATCH FOR Q9Y478 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9Y478.pdb (Expected: 270, Actual: 425)
LENGTH MISMATCH FOR Q9Y6M4 - /cluster/home/t122995uhn/projects/data/kiba/structures/Q9Y6M4.pdb (Expected: 447, Actual: 298)

Even some of the AlphaFold structures are mismatched (2 out of the 35 that were non-hits from uniprot mapping search)

LENGTH MISMATCH FOR P78527 - .../P78527.pdb (Expected: 4128, Actual: 3551) 
LENGTH MISMATCH FOR Q9HBY8 - .../Q9HBY8.pdb (Expected: 427, Actual: 367)

Code to replicate above:


from src.data_processing.downloaders import Downloader
import pandas as pd
import json, shutil, os

root_dir = '/cluster/home/t122995uhn/projects/data/kiba'
save_dir = f'{root_dir}/structures'

unique_prots = json.load(open(f'{root_dir}/proteins.txt', 'r'))
# [...]
# send unique prots to uniprot for structure search

##### Map to PDB structural files
# downloaded from https://www.uniprot.org/id-mapping/bcf1665e2612ea050140888440f39f7df822d780/overview
df = pd.read_csv(f'{root_dir}/kiba_mapping_pdb.tsv', sep='\t')
# getting only first hit for each unique PDB-ID
df = df.loc[df[['From']].drop_duplicates().index]

# getting missing/unmapped prot ids
missing = [prot_id for prot_id in unique_prots.keys() if prot_id not in df['From'].values]

# %%
##### download pdb files
Downloader.download_PDBs(df['To'].values, save_dir=save_dir)

# retrieve missing structures from AlphaFold:
Downloader.download_predicted_PDBs(missing, save_dir=save_dir)

# NOTE: some uniprotIDs map to the same structure and so using the df mapping we will rename the mapped pdb to be uniprot file names
# copying as neccessary

# copying to new uniprot id file names
for i, row in df.iterrows():
    uniprot = row['From']
    pdb = row['To']
    # finding pdb file
    f_in = f'{save_dir}/{pdb}.pdb'
    f_out = f'{save_dir}/{uniprot}.pdb'
    if not os.path.isfile(f_in):
        print('Missing', f_in)
    elif not os.path.isfile(f_out):
        shutil.copy(f_in, f_out)

# removing old pdb files.
for i, row in df.iterrows():
    pdb = row['To']
    f_in = f'{save_dir}/{pdb}.pdb'
    if os.path.isfile(f_in):
        os.remove(f_in)

#%% Some downloaded pdbs dont match the provided input sequence
from src.utils.residue import Chain
# mismatch
mismatch = {}
for uniprot, seq in unique_prots.items():
    if uniprot in missing: continue
    f_in = f'{save_dir}/{uniprot}.pdb'
    if not os.path.isfile(f_in):
        print('Missing', f_in)
    else:
        c = Chain(f_in)
        pdb_sequence = c.getSequence()
        if len(seq) != len(pdb_sequence):
            mismatch[uniprot] = (c, seq, -1)
            print(f'LENGTH MISMATCH FOR {uniprot} - {f_in} (Expected: {len(seq)}, Actual: {len(pdb_sequence)})')
        else:
            # Calculate the number of mismatches
            mismatch_count = sum(1 for i in range(len(seq)) if seq[i] != pdb_sequence[i])

            if mismatch_count > 0:
                mismatch[uniprot] = (c, seq, mismatch_count)
                print(f'MISMATCH FOR {uniprot} - {f_in} (Mismatches: {mismatch_count})')
jyaacoub commented 1 year ago

Checking all pdb files for the correct sequence reduces the number of misses but we still have some that are missing:

Code to download pdb structures

import json, os
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
from tqdm.contrib.concurrent import thread_map  # Use tqdm for multithreading
from src.data_processing.downloaders import Downloader

root_dir = '/cluster/home/t122995uhn/projects/data/kiba_tmp'
save_dir = f'{root_dir}/structures'

# Contains protein sequences mapped to uniprotIDs
unique_prots = json.load(open(f'{root_dir}/proteins.txt', 'r'))
# [...] send to https://www.uniprot.org/id-mapping to get associated pdb files
# this returns a tsv containing all matching pdbs for each unique uniprotID
df = pd.read_csv(f'{root_dir}/kiba_mapping_pdb.tsv', sep='\t')

#%% Downloading pdbs
def download_pdb(pdb_id):
    try:
        Downloader.download_PDBs([pdb_id], save_dir=save_dir, tqdm_disable=True)
        return pdb_id  # Return the downloaded PDB ID for progress tracking
    except Exception as e:
        print(f"Error downloading {pdb_id}: {str(e)}")

# Get unique PDBs
pdbs = df['To'].unique()

# Number of concurrent threads (adjust as needed)
num_threads = 4

# Use tqdm with ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    # Use thread_map for tqdm integration
    downloaded_pdbs = list(thread_map(download_pdb, pdbs, desc="Downloading PDBs", total=len(pdbs)))

print("DONE.")

Code to check and match pdb sequences

#%%
import json, os
import pandas as pd
from src.utils.residue import Chain
from tqdm import tqdm
from prody import parsePDB

root_dir = '/cluster/home/t122995uhn/projects/data/kiba_tmp'
save_dir = f'{root_dir}/structures'
pdb_fp = lambda x: f'{save_dir}/{x}.pdb'

# Contains protein sequences mapped to uniprotIDs
unique_prots = json.load(open(f'{root_dir}/proteins.txt', 'r'))
# [...] send to https://www.uniprot.org/id-mapping to get associated pdb files
# this returns a tsv containing all matching pdbs for each unique uniprotID
df = pd.read_csv(f'{root_dir}/kiba_mapping_pdb.tsv', sep='\t')

#%%
matches = {} # tracks matching sequences to pdb structures
fails = []
for i, row in tqdm(df.iterrows(), desc='Matching pdbs', total=len(df)):
    uniprot = row['From']
    pdb = row['To']
    if uniprot in matches: continue # already matched
    try:
        seq = parsePDB(pdb_fp(pdb), subset='ca').getSequence()
    except Exception as e:
        fails.append((pdb, e))
        # raise Exception(f'Error on {i}, ({uniprot}, {pdb}).') from e
    if seq == unique_prots[uniprot]:
        matches[uniprot] = pdb

#%% Fails are due to empty pdb files
for c, _ in fails:
    os.remove(pdb_fp(c))

Only 1 match :'(

jyaacoub commented 1 year ago

Matching on each chain gives only slightly more matches (61), but still a lot are missing...

Code

for i, row in tqdm(df.iterrows(), desc='Matching pdbs', total=len(df)):
    uniprot = row['From']
    pdb = row['To']
    if uniprot in matches: continue # already matched
    try:
        pdb_s = parsePDB(pdb_fp(pdb), subset='ca')
        seq = pdb_s.getSequence() # includes all chains 
    except Exception as e:
        fails.append((pdb, e))
        # raise Exception(f'Error on {i}, ({uniprot}, {pdb}).') from e

    curr_seq = unique_prots[uniprot] 

    if (len(seq) == len(curr_seq) and seq == curr_seq) or \
        (len(seq) > len(curr_seq) and curr_seq in seq)or \
        (len(seq) < len(curr_seq) and seq in curr_seq):
        matches[uniprot] = pdb
jyaacoub commented 11 months ago

Solution was to just use High quality generated structures using AlphaFold!