lightdock / lightdock-rust

A Rust implementation of the LightDock macromolecular docking software
https://lightdock.org/
GNU General Public License v3.0
26 stars 4 forks source link

DDNA function #10

Closed stianale closed 10 months ago

stianale commented 1 year ago

Hi,

Are you considering implementing the DDNA scoring function in the Rust version? I want to try different scoring schemes; I have way too many analyses to do for running the non-Rust version of Lightdock.

brianjimenez commented 1 year ago

Unfortunately, I don't have time at the moment to re-implement it on the Rust code. Have you tried using dna scoring function? You will need to add protonation using pdbfixer as we do in the LightDock web server, but that will work better than ddna scoring for sure.

stianale commented 1 year ago

I understand.

I have used the dna function, yes (following the protein-DNA tutorial for your software), and it works, apart from having to remove all "OXT", "HO5'" and "HO3'" atoms, since I could not find any AMBER94 equivalents for these. But I just don't get the expected results, so I want to look into the ddna function. My situation is that I have orthologous proteins from 11 bacterial species, predicted using Colabfold, and dsDNA for 11 different 16-mer strings (so 32 nt in the helix) predicted using both 3D-Dart and 3d-DNA. I am implementing restraints both on receptor and ligand.

The thing is that we know, experimentally, that each of the proteins bind stronger to one of the 11 DNA strings and we know which each species prefer. Thus, I want to dock each of the 11 DNAs to each of the orthologous proteins. What I thus expected, is to get better score for the docked complexes where the protein binds its preferred DNA. While I do get high scores and percentage/fraction satisfied restraints for the structures ranked by filtering in the end (often above 0.8), my results so far do not align with the expected bias toward native protein and preferred DNA. This means that there is uncertainty either in the predicted protein structures, DNAs or the docking itself. I suspect that the problem might lie within the predicted proteins, as there is research indicating that docking on AlphaFold structures often yield inaccurate results.

Maybe I have misunderstood something though, and that my expectation of higher scores for the native protein/DNAs docked is unfounded.

brianjimenez commented 1 year ago

Thanks for explaining the context of your system, it is a very interesting project. DDNA is a statistical potential scoring function, designed to re-score poses, but not to drive a docking simulation as LightDock performs/requires, that is the reason behind I did not re-implement it on the Rust version. Maybe it would be better to offload this conversation to email and share some LightDock simulations using dna scoring function to see if I could be of some help.

JorgeRoel commented 1 year ago

Hi @stianale. This is a tricky issue indeed. First, it is important to keep in mind that all docking algorithms have been designed to predict a putative binding pose and therefore, you always get solutions even if the molecules do not interact in reality. I think that your approach is theoretically correct in the sense that if you make the "intra-protein" comparisons (i.e. same protein agains the 11 DNAs). If you start mixing proteins and DNAs the scoring looses sense completely. However, when you specify restraints, the scoring is also biased according to the information (besides sampling), depending on the percentage of satisfied restraints (new_score = old_score + percentage_satisfied_restraints*old_score) so I believe that the new_score is no longer representative of the "binding" for your specific purposes. You can try several approaches here: 1 - Use your actual simulations and rescore your best complexes with bin/lgd_calculate_scoring.py in order to get rid out of the restraint bias 2 - You can perform blind docking (which was shown for DNA to work quite well) and use your exp. info to filter out the complexes satisfying that info. After filtering, you can compare the scores. By doing this you check whether (1) there is an unbiased preference towards the right complex if the algorithm is able to recapitulate the correct interface and (2) the scores are more favorable.

stianale commented 1 year ago

Thanks for both of the quick responses.

I found possible clues. It is believed that two disulfide bonds are crucial to the binding for these proteins to their DNA. I saw that cysteines involved in such bonds must be specified in the pdb as "CYX", not CYS. I did not do that. Also, I tried your approach 1., JorgeRoel, but not sure if I do it correctly, but when (which for me makes no sense) run as lgd_calculate_scoring.py dna lightdock_protein.pdb lightdock_dna.pdb: "[NotSupportedInScoringError] Residue A.HIS.117 or atom HE2 not supported. DNA scoring only supports AMBER94 types."

I did see that HIS must be specified as HIP in amber force fields. I am not sure why I did not get this error when running the simulations, however.

I also tried running lgd_calculate_scoring.py ddna on one of the best models (lgd_calculate_scoring.py ddna swarm_242_146.pdb), but then the error is "usage: calculate_scoring [-h] scoring_function receptor ligand calculate_scoring: error: the following arguments are required: ligand"

I have no idea how to do approach 2., JorgeRoel, I will have to look into that.

JorgeRoel commented 1 year ago

For lgd_calculate_scoring.py, you need to have receptor and ligand in separate files: lgd_calculate_scoring.py dna swarm_242_146_protein.pdb swarm_242_146_dna.pdb

For approach 2. you'd need to rerun all your lightdock simulations but in blind mode (i.e. without specifiying restraints during setup). Once the simulations are done, you can perform your analyses as if you used restraints (generation, clustering and filtering using a restraints.list with your info). After that, you'll get your "filtered" folder with your models satisfying your restraints (depending on the cutoff) and you can then see if the correct pair actually has a better score and recapitulates the interface like you wish.

stianale commented 1 year ago

I am sorry for taking up your time on this.

But I see, before using lgd_calculate_scoring.py I just split the swarm_242_146.pdb into those two separate files files.

I just realized that the hydrogen atoms for the DNA in the predicted docked complex are not renumbered. I wonder if this makes the simulation ignore the atoms dubbed "0":

image

brianjimenez commented 1 year ago

No worries on the numeration for hydrogens, it is ignored by LightDock. For using lgd_calculate_scoring.py you will need to rename HIS to HID, CYS to CYX and probably some other hydrogens. This is like this since when LightDock generates structures predicted, it tries to use the original PDB residues despite internally makes these changes.

stianale commented 1 year ago

Changing to CYX and HID did not work for lgd_calculate_scoring.py, nor when changing "ATOM" to "HETATM". Also tried HIP and HIE for histidine.

"[pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue HID.46 [pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue CYX.48 [pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue CYX.90 [pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue CYX.97 [pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue CYX.113 [pdb] WARNING: Possible problem: [ResidueNonStandardError] Can not check non-standard residue HID.117"

brianjimenez commented 1 year ago

These are actual warnings, not errors. Did you get any error at the end of the execution?

stianale commented 1 year ago

For HID, yes:

_Traceback (most recent call last): File "/home/stian/miniconda3/lib/python3.10/site-packages/lightdock/scoring/dna/driver.py", line 91, in _get_docking_model raise e File "/home/stian/miniconda3/lib/python3.10/site-packages/lightdock/scoring/dna/driver.py", line 83, in _get_docking_model atom.amber_type = amber.amber_types[atom_id] KeyError: 'HID-HE2'

But for HIP it actually worked, so I will score and see.

stianale commented 10 months ago

Hi again, and thanks for the help so far.

The protein I simulate is part of a type IV pilin complex, which is a transmembrane protein. While my protein is located at the and not near the phospholipid layer, should I still do membrane docking? If so, can membrane docking be combined with DNA docking in Lightdock?

brianjimenez commented 10 months ago

Hi @stianale . If you think the extra information coming from the phospholipid layer may help you to block some sampling regions on the receptor surface, I would use the LightDock membrane docking protocol. Otherwise, you can always filter out predictions depending on regions of the surface which are not compatible with the membrane. At the moment, the only scoring function compatible with the membrane protocol supported is fastdfire.