timkartar / DeepPBS

Geometric deep learning of protein–DNA binding specificity
BSD 3-Clause "New" or "Revised" License
46 stars 5 forks source link

can not process custom pdb file #1

Closed AY-LIANG closed 7 months ago

AY-LIANG commented 7 months ago

I run the code on Code Ocean. When I replace the input file with my own pdb, there will be some problem and no npz file is generated. For example, I used 3hos.pdb(Molecular architecture of the Mos1 paired-end complex: the structural basis of DNA transposition in a eukaryote) as input, The output information is as follows. The output information is as follows.

Processing file 'dna.tmp.pdb'
  G.DT.28             0.122
    total number of nucleotides: 159
    total number of base pairs: 76
    total number of helices: 3
    total number of stems: 4
    total number of non-pairing interactions: 168
    total number of splayed-apart dinucleotides: 1
    total number of internal loops: 1
    total number of non-loop single-stranded segments: 4

Time used: 00:00:00:00
done with cleaning up files.

Time used: 00:00:00:00

......Processing structure #1: <dna_entity_0.inp>......
[i] missing ' P  ' atom : residue name ' DA', chain F, number [  29 ]
[i] missing ' OP1' atom : residue name ' DA', chain F, number [  29 ]
[i] missing ' OP2' atom : residue name ' DA', chain F, number [  29 ]
[i] missing ' P  ' atom : residue name ' DA', chain F, number [  29 ]

Time used: 00:00:00:00

......Processing structure #1: <dna_entity_1.inp>......
This structure has broken O3'[i] to P[i+1] linkages
[i] missing ' P  ' atom : residue name ' DA', chain H, number [  29 ]
[i] missing ' OP1' atom : residue name ' DA', chain H, number [  29 ]
[i] missing ' OP2' atom : residue name ' DA', chain H, number [  29 ]
[i] missing ' P  ' atom : residue name ' DA', chain H, number [  29 ]

Time used: 00:00:00:00

......Processing structure #1: <dna_entity_2.inp>......
[i] missing ' P  ' atom : residue name ' DA', chain D, number [  29 ]
[i] missing ' OP1' atom : residue name ' DA', chain D, number [  29 ]
[i] missing ' OP2' atom : residue name ' DA', chain D, number [  29 ]
[i] missing ' P  ' atom : residue name ' DA', chain D, number [  29 ]

Time used: 00:00:00:00
Helix score: 0.8933333333333335
0 2
3 6
7 10
11 14
15 18
19 22
22 25
27 30
31 34
34 37
38 41
42 45
46 49
50 53
54 57
58 61
62 65
66 69
70 73
74 77
78 81
82 85
86 89
89 92
94 96
Helix score: 0.7597222222222223
0 2
3 6
7 10
11 14
15 18
19 22
22 25
26 29
30 33
33 36
36 39
41 44
48 51
52 55
56 59
60 63
63 66
68 71
71 74
75 78
79 82
83 86
88 91
92 94
Helix score: 1.0
0 2
3 6
7 10
11 14
16 19
20 23
23 26
27 30
30 33
34 37
38 41
42 45
46 49
50 53
54 57
58 61
62 65
66 69
70 73
74 77
79 82
82 85
86 89
90 93
95 97
ERROR: helix count problem 3 3hos.pdb
timkartar commented 7 months ago

Hi ! Thanks for bringing this up. As of now, DeepPBS input has to have one DNA helix. As the output suggests, yours has three. The simplest solution is to create three separate files (using a tool like pymol/biopython) with only one helix in each and running them separately. Please let me know how that goes.

update: I created the files for you as an example, see here: https://drive.google.com/drive/folders/1rSg6YV35cfBrQK_aF1Vl-2EPJsxqKVSM?usp=sharing

Example output is here: https://rohslab.usc.edu/deeppbs/link/171180933088 PS: you can also use this webserver now, instead of code ocean ! (https://rohslab.usc.edu/deeppbs/)

AY-LIANG commented 7 months ago

Thank's for your reply. I have tried the separate file and it worked well. I have another pdb file generated by HDOCK(http://hdock.phys.hust.edu.cn/),and it's a protein-DNA docking model. The file contains information about the 3'/5' ends of the DNA like DT5/DC3, and I get error massage:

Processing file 'dna.tmp.pdb'
    total number of nucleotides: 176
    total number of base pairs: 88
    total number of helices: 1
    total number of stems: 1
    total number of non-pairing interactions: 178

boundary for lvector(): [1 to 0]

Time used: 00:00:00:00
done with cleaning up files.

Time used: 00:00:00:00
Traceback (most recent call last):
  File "../process_co_crystal.py", line 71, in <module>
    dna_data = processDNA(dna, quiet=False)
  File "/opt/conda/lib/python3.8/site-packages/deeppbs/process_dna.py", line 1547, in processDNA
    n = getNucleotideData(nt, model, D.chem_components)
  File "/opt/conda/lib/python3.8/site-packages/deeppbs/process_dna.py", line 319, in getNucleotideData
    "chemical_name": COMPONENTS[nt["nt_name"].strip()]['_chem_comp.name']
KeyError: 'DC5'

so I remove the terminal information but still encounter an error:

Processing file 'dna.tmp.pdb'
    total number of nucleotides: 176
    total number of base pairs: 88
    total number of helices: 1
    total number of stems: 1
    total number of isolated WC/wobble pairs: 2
    total number of non-pairing interactions: 178

boundary for lvector(): [1 to 0]

Time used: 00:00:00:00
done with cleaning up files.

Time used: 00:00:00:00
Traceback (most recent call last):
  File "../process_co_crystal.py", line 71, in <module>
    dna_data = processDNA(dna, quiet=False)
  File "/opt/conda/lib/python3.8/site-packages/deeppbs/process_dna.py", line 1547, in processDNA
    n = getNucleotideData(nt, model, D.chem_components)
  File "/opt/conda/lib/python3.8/site-packages/deeppbs/process_dna.py", line 324, in getNucleotideData
    nucleotide = getNucleotideById(model, nid)
  File "/opt/conda/lib/python3.8/site-packages/deeppbs/process_dna.py", line 156, in getNucleotideById
    return model[ch][rid]
  File "/opt/conda/lib/python3.8/site-packages/Bio/PDB/Entity.py", line 45, in __getitem__
    return self.child_dict[id]
KeyError: ''

Here are my input files: https://drive.google.com/file/d/1pWrLywvxXdc_Auik3GKELmkr4XQajmqb/view?usp=drive_link

https://drive.google.com/file/d/11DwEFWRlGcWXpobQ7o94gNMI3r0YZOgW/view?usp=drive_link

timkartar commented 7 months ago

Hi, glad the first one worked out ! The next file may not be following the PDB format property. I am happy to take a look for you, but the drive links are inaccessible to me. Please make them visible to anyone with the link.

AY-LIANG commented 7 months ago

Sorry for the mistake. Links are available now. https://drive.google.com/drive/folders/1t0lm0iamodCFOLWfX67fj4axKrUkHq-x

timkartar commented 7 months ago

Hello there ! Thanks for the update. I went ahead and took a look. There was something weird about the way you did the removal. I wrote a simple biopython script for you to do the same and it works. Please run this and use the output pdb file.

from Bio.PDB import PDBParser, PDBIO
parser = PDBParser()
model = parser.get_structure("model_1", "./model_1.pdb")[0]

for res in model['B'].child_list:
    rid = res.get_id()
    res.resname = res.resname[:2]
    print(res, res.resname)
io = PDBIO()
io.set_structure(model)
io.save("./fixed.pdb")

Output link : https://rohslab.usc.edu/deeppbs/link/171202305522

You can open both your "model_1_removed_terminal.pdb" and this "fixed.pdb" and compare them through pymol Sequence viewer to see the differences.

PS: The docking for the homeodomains in the structure does not look very good. You may want to somehow refine them.

Let me know if you have any further questions.

AY-LIANG commented 7 months ago

The script is useful, and now I can run successfully. The webserver is quite convenient. Thank you very much for your help!

timkartar commented 7 months ago

Great ! Thanks for reaching out. Just a note though that the webserver is still under development. But more news and updates will follow. Closing the issue now.