Adding pdb_addter - Githubissues

andrewsb8 commented 3 years ago

Hi,

Wrote a script to add TER lines in user defined ways similar to pdb_delres.py. This would be a good functionality for multiple reasons:

PDB files with multiple proteins which are aggregates typically need termini changed before simulating and adding TER lines at specific intervals will help remove manual editing of PDB files.
It would work as a good preparation step for large PDB files before pdb_chainbows.py.
Large system pdb files (which I uploaded as a test file for this script, 692 residues) which require multiple TER entries (potentially at different intervals) prior to simulation can be made much more easily with a script.

I include pdbtools/pdb_addter.py, tests/test_pdb_addter.py, and tests/data/addter_test.pdb where all 13 unit tests in test_pdb_addter.py pass.

Let me know what you think,

Brian

joaomcteixeira commented 3 years ago

Another example, adds TER before the start of the PDB.

python pdbtools/pdb_addter.py -4::  tests/data/dummy.pdb

REMARK    CHAIN C DOES NOT HAVE ELEMENTS FOR LAST RESIDUE
REMARK    CHAIN C CONTAINS RESIDUES OUT OF ORDER
REMARK    CHAIN D IS NUCLEIC ACID AND LACKS A TER STATEMENT
TER
ATOM      1  N   ARG B   4      36.898  42.175  -2.688  1.00  0.00           N  
ATOM      2  H   ARG B   4      37.673  41.570  -2.916  1.00  0.00           H  
ATOM      3  H2  ARG B   4      35.929  41.470  -2.758  1.00  0.00           H  
ATOM      3  H3  ARG B   4      37.099  42.392  -1.524  1.00  0.00           H  
ATOM      3  CA  ARG B   4      37.080  43.455  -3.421  1.00  0.00           C  
ATOM      3  HA  ARG B   4      38.102  43.960  -3.065  1.00  0.00           H

joaomcteixeira commented 3 years ago

Another comment, The PDB you propose as a test case has about 47k lines. That is too big for the scope of this project. Please look for a minimum case scenario needed to test the new tool. If this PR were accepted, the 47k lines from your PDB would collapse the whole statistics of this repository and that wouldn't be fair. For example, dummy PDB has about 200 lines. Cheers!

joaomcteixeira commented 3 years ago

What about pdb_add_manual_ter? A bit longer, but at least more specific.

That could be a proper name. As @JoaoRodrigues said, addter might be confusing because of pdb_tidy which already does correct TER addition.

andrewsb8 commented 3 years ago

Thanks for the feedback!

The name change makes sense. Having ability to manually place TER records will be helpful ultimately given the other scripts' purposes (based on my understanding). Ex: chainbows requires TER records, tidy requires Chain identifiers. Some tools don't always give Chain IDs (CHARMM-GUI, Gromacs if Chain IDs aren't provided and no TER records already in place, VMD in some cases, etc). So this tool could be complicated but would help complete the suite as some of the tools need one or the other to already exist.

I have some questions and comments as well:

Should I rename my branch? Or is just renaming the script and test sufficient?
I get an error using chainbows as well (on a normal pdb where I did not use pdb_addter.py first). I don't have the error message, but the last commit message for that script is "half way done". So I assumed that was not complete yet and didn't submit an issue.
For tests with nonsequential residue numbers, it might be best to throw an error when that's the case here. Redirect the user to pdb_reres to avoid undesired results and potential confusion.

Otherwise, I will get to those changes, write tests for a smaller pdb file and dummy.pdb (based on your feedback to the above suggestion), and make the names appropriate. Thanks for your time.

Brian

joaomcteixeira commented 3 years ago

Hi,

you can stay on the same branch, no need to do a new branch,

I get an error using chainbows as well (on a normal pdb where I did not use pdb_addter.py first). I don't have the error message, but the last commit message for that script is "half way done". So I assumed that was not complete yet and didn't submit an issue.

i don't understand, can you rephrase? Can it be that chainbow has a bug?

For tests with nonsequential residue numbers, it might be best to throw an error when that's the case here. Redirect the user to pdb_reres to avoid undesired results and potential confusion.

pdb-tools should not raise errors (I think). these are very simple tools that should do the job when the PDB follows (more or less) the right formatting rules. if the PDB is "crazy" or has "errors incompatible with the tool", I am not sure if it should be the responsibility of pdbtools to warn about that. I think we don't do that in any other tool. @JoaoRodrigues ?

Yes please, use the smallest possible reproducible case for PDB tests.

Cheers!

andrewsb8 commented 3 years ago

I'll have to get back to you about pdb_chainbows error as I cannot check today.

If not throwing an error, the function could mimic that of pdb_delres which asserts that the residues are deleted regardless of sequence number. pdb_add_manual_ter is written similarly so having it behave this way would be consistent with previous practices in the repository. Based on the test with dummy.pdb, there would still have to be changes made which shouldn't be too difficult (i.e. not considering the value of the residue number in the pdb file when determining if in the residue range to add a TER).

joaomcteixeira commented 3 years ago

Sounds good so far. Give it a try on the review comments and your considerations. We will review it again after that. Best, and we are very happy to have you interacting with pdb-tools :relaxed:

andrewsb8 commented 3 years ago

Hi all,

I've made some fairly large changes to the branch pdb_addter. I'll outline some of the largest here:

is_integer is changed to return_integer, which now also handles the error message for checking that inputs are correct as mentioned in a comment above. Also saves some lines in check_input.
The script now counts residues without the context of the residue number in the pdb file.
Starting residue range is now required to be 1 or greater because of previous point.
Added make_TER from pdb_tidy to add full TER entries.
Added a block of code that stops adding TER entries when records such as END and CONECT are reached.
Existing TER lines or entries are ignored and TER entries are no longer duplicated.
The test case provided by me is now much shorter lol.
I wrote tests for dummy.pdb to confirm that TER are not added after END/CONECT records and out of order residue numbers do not affect the placement of TER entries.

Let me know what you think or if you run into any issues!

Brian

andrewsb8 commented 2 years ago

Hey all,

I know that it's end of term and holiday season. Just wanted to follow up because I had actually forgotten about this until a minute ago :sweat_smile:. Hope the end of the term goes well for you both and that you enjoy your holiday break!

Brian

joaomcteixeira commented 2 years ago

Hi, @andrewsb8 sorry for the late reply. These have been busy days on all fronts. My apologies. I will review this PR today/tomorrow and let you know :relaxed: stay tuned. :+1: Thanks very much for all your efforts so far :muscle:

andrewsb8 commented 2 years ago

Happy new year! Thanks for the feedback. Have a couple of questions based on it.

The main reason I switched to just counting residues, instead of considering the residue numbering in the file, was largely because of the dummy.pdb file. Say, for example, I wanted to run add_manual_ter -4:7:1 dummy.pdb, I would run into a potential issue because the residue numbering is 4,6,7,1,2,3,5,2. So, I would get a TER entry after the fifth residue but that feels like weird behavior sinces it's not next to 4, 6, and 7 and out of order. Similarly, the option -1:3 would produce TER entries after both residues identified as 2 in that pdb. The latter case could be particularly relevant if you are "combining" pdbs of two or more structures for whatever purpose as it would be likely that residue numbers would repeat (and would be a great use case in tandem with pdb_reres). Both methods definitely have their downsides. What are your thoughts?
Makes perfect sense. I'm working on that right now but that actually lead to noticing something else worth mentioning.
The 'TER' entries added include an entry number but ends up duplicating:

ATOM 51 HB3 ALA X 5 ...... TER 52 ALA X 5 ...... ATOM 52 N ALA X 6

So the 52 is duplicated. I can either leave that number out (so only ATOM, HETATM, etc get numbers) or work to make sure the entry numbers are incremented for the number of added TER entries preceding it. What do you all think? Edited to add: This is actually an interesting issue for dummy.pdb. As the method generating the TER entry increments the atom number and there are many duplicate atom numbers already. An example from dummy.pdb is:

ATOM 3 O ARG B 4 ..... TER 4 ARG B 4 ...... ATOM 3 N GLU B 6

There the second entry is changed but everything after the TER entry does not follow it. Let me know what you think! Thanks again for looking over my work and hope you enjoyed the holiday break!

Brian

joaomcteixeira commented 2 years ago

Hi Brian, thanks for your cheerful words. We had great holidays, we hope you too :-)

All the points you are rising are definitively very tricky and I feel they are getting out of the PDB standard format because of the added complexity. The point you raise for 1) is a good one. Maybe it is best to leave it as "per residue" instead of the actual residue numbers of what you want. Sometimes I think what we if the tool should be add ters at specific residue/chains, for example -A:100. But the question raises: what is the result of passing pdb_tidy after pdb_add_manual_ter? I am starting to have concerns if tools do things outside the PDB format. Maybe what you were looking for is some addition to pdb_tidy? Or some pdb_split* combined with pdb_chain and pdb_tidy that maybe we don't have?

Regarding 3), pdb_tidy corrects for atom numbers. I think this tool should also.

I was rereading again the whole conversation here, definitively @JoaoRodrigues raise a good point. Could you point us to a real-world example (PDBID) and would be your perfect outcome so we can investigate further?

Large system pdb files (which I uploaded as a test file for this script, 692 residues) which require multiple TER entries (potentially at different intervals) prior to simulation can be made much more easily with a script.

Could so send us again these examples? (transfer.sh)

Cheers,

andrewsb8 commented 2 years ago

I will make it so the atom numbers are fixed, no problem!

As for examples of what I'm looking for, I'm rarely working with a single pdb id and sometimes not one at all. I have three examples of some current projects with large systems:

Villin HP36 (1VII) and Actin Complex (1ATN): https://transfer.sh/XeL13h/1VII_1ATN.pdb
GLP-1 (1D0R) and 160 glycyl-alanyl-glycine peptides (structure produced by VMD): https://transfer.sh/wbrDyH/GLP-1_200mMGAG.pdb
The system you asked about. Amyloid Beta 1-42 hexamer next to a bilayer complex (generated by CHARMM-GUI) of DMPC, CHL1 (Cholesterol), and GM1. The bilayer isn't derived from a pdb from the actual Protein Data Bank: https://transfer.sh/Ngi4zA/ABeta_bilayer_sort.pdb

Before finding this project, I've been modifying these files largely manually which has been very time intensive. Being able to organize these pdb files via some combination of pdb_reres, pdb_merge and/or pdb_tidy has been greatly useful. TER entries are particularly important in many simulation use cases for simulation of these systems. Each use case above requires a different number of TER entries at different intervals (3 in first case, 161 in second, and 7 in the third) and chainbows won't help without preexisting TER entries in most cases listed above. This seemed to be the only step I had to still performed manually when creating systems for simulation after beginning to use this package and motivated me to suggest the addition.

andrewsb8 commented 2 years ago

I have updated pdb_add_manual_ter according to the suggestions, namely:

Atom numbers are corrected
Existing TER entries are renumbered, or replaced with correctly formatted TER entries
The tests now have thorough checks for where the TER entries are (whether they existed in the file or were added by pdb_add_manual_ter)

Let me know what you think of my changes or of my prior comment.

Best,

Brian

joaomcteixeira commented 2 years ago

Hi @andrewsb8 , Thanks so much for your clear explanatory messages. I will check the examples you gave us and your code, and let you know. Cheers,

haddocking / pdb-tools

Adding pdb_addter #112