Samplepaths command never concludes

wlawler45 commented 2 years ago

I am trying to run the samplepaths command as described in the examples, but the command does not seem to complete no matter how low I set the number of paths to. I ran the following command: ./samplePaths --targetPDB ../example/input_files/5U1C_dimer.pdb --seedBin ../example/1_generateSeeds/output/extendedfragments.bin --seedGraph /home/williamubuntu/peptide_design/example/3_buildSeedGraph/5U1C_seedgraph.adj --numPaths 500 --config ../example/input_files/singlechainDB.configfile --base 5U1C --writeTopology for a few days with no results, after changing numPaths to 5 it still did not complete and gave the following file sample_paths.zip as output from console. Any advice on what might be going wrong would be appreciated. Thank you. sample_paths.zip

swanss commented 2 years ago

Hi,

It looks like paths are being sampled, but they do not meet the acceptance criteria.

"Sampled path is the same as initially selected seed" means that the first seed that was selected had no overlaps to any other seed
It also looks like you're getting paths that are shorter than 15 residues

I would think a little bit about your goal and then adjust the parameters accordingly

Do you have a binding site of interest? If so I would focus on generating seeds only around residues close to that site.
Are you getting sufficient overlaps between seeds to sample diverse paths? If not, generating more seeds or relaxing the overlap criteria can help you find more overlaps at the cost of time/quality
Do you need a 15 residue peptide? In the paper we chose that cutoff to show that it's possible to sample peptide binder backbone structures of that length, but it's possible that it's harder to find paths of that length when considering other target proteins.

swanss commented 2 years ago

By the way, I just peeked back at the structure you sent in the other issue thread (5U1C_dimer) and noticed it's not the same as the structure in the PDB and that it had B-factors value consistent with being the pLDDT from an alphafold model. If possible, I would highly suggest trying to focus your designs to the region of the structure that is indicated to have high confidence (pLDDT > 90). A design like the fused path you sent will have a lot going against it: limited hydrophobic contact surface area and a highly dynamic binding site. All of that is even assuming this dimeric state occurs in solution. Is there experimental evidence for it?

wlawler45 commented 2 years ago

I did target certain residues in that protein, I used the initial command: ./generateSeeds --targetPDB ../example/input_files/5U1C_dimer.pdb --paramsFile ../example/1_generateSeeds/genSeeds-HIV-IN.params --targetSel "resid 115-153" --peptideChainID 'A' , would this not do that?

swanss commented 2 years ago

Yes, that should only generate seeds around residues 115-153, could you please check the output file to verify that's what happened? How many did you generate per residue?

wlawler45 commented 2 years ago

Check Extendedfragments.bin?

wlawler45 commented 2 years ago

extendedfragmentsinfo.zip It seems to have generated a great many seeds for each residue, ~90 for each residue?

swanss commented 2 years ago

I meant the standard output from when you ran the job, but this file works too! It looks like some residues have as many as ~500 seeds. I know this seems like a lot, but the space of possible > 5 residue binding structures is very large, so this is a relatively small sample of that. If I recall correctly, we searched for up to 5,000 matches for each binding site fragment when designing binders of TRAF6. If I was choosing to focus on a specific binding site I would probably bump that up to at least ~10,000. This will make the subsequent steps slower, but will increase the odds that you find a good backbone. Do you have access to a cluster, or do you need to run this locally?

wlawler45 commented 2 years ago

I'm running locally, my cluster has a 6 hour time limit that this will exceed unfortunately.

wlawler45 commented 2 years ago

Would it help if I narrowed the selection of residues further? There is only really 3 residues in the whole protein that I'm interested in, but they lie spread out in the range I gave that command.

wlawler45 commented 2 years ago

Would it help if I narrowed the selection of residues further? There is only really 3 residues in the whole protein that I'm interested in, but they lie spread out in the range I gave that command.

swanss commented 2 years ago

Ah okay, so I would do (resid i or resid j or resid k) around d, where d is the maximum distance in angstroms at which you consider neighboring residues (I would start with d = 10.0 or so and increase if necessary). It's important to include more than just the 3 residues since you will likely want seeds that can interact with nearby residues to get enough binding energy.

swanss commented 2 years ago

For most steps (with the exclusion of dTERMen) you should be able to run the jobs in less than 6 hours. Even the jobs that are slower, like findOverlaps, can be broken up into array jobs that will individually be short enough to run on your cluster.

wlawler45 commented 2 years ago

Okay, let me try to do this on the cluster then. I appreciate your help again.

swanss commented 2 years ago

Of course! I'm available here if you have more questions about the code and would also be happy to set up a zoom call to chat about your specific design problem if you'd like.

wlawler45 commented 2 years ago

I appreciate the offer of a zoom call, if I have some more significant trouble I will let you know and we can set something up. So just to clarify, the gen_seeds command should be as follows? I'm getting an error saying fragment_type not recognized.

srun -N 1 -n 10 -t 360 ./generateSeeds --targetPDB ../example/input_files/5U1C_dimer.pdb --paramsFile ../example/1_generateSeeds/genSeeds.params --targetSel "(resid 64 or resid 116 or resid 152) around 10" --peptideChainID 'A'

wlawler45 commented 2 years ago

Nevermind, that issue was related to an EOL conversion since I had to download the library to windows and move it onto the UNIX cluster.

wlawler45 commented 2 years ago

Okay, so I ran the commands up until run_samplepath.sh on the cluster, the other commands took on the order of minutes to complete, run_samplepaths.sh took the whole 6 hours and failed to finish, and was consistently saying path too short. Is there any information that I could send you that would help to identify what this issue might be? I also would like to schedule a zoom call with you sometime this week if possible. Thank you for any assistance you can give me with this.

swanss commented 2 years ago

It's very likely that there are not enough overlaps between the seeds in the graph. The easiest way to confirm this would be for you to send me the standard output file from buildSeedGraph, which reports how many edges in the graph are between residues in the same seed vs. residues in distinct seeds.

swanss commented 2 years ago

Shoot me an email swans@mit.edu and we can set up a call!

swanss / peptide_design

Samplepaths command never concludes #13