New tools for protein, nucleic, carbo, ligands

haddocking / pdb-tools

A dependency-free cross-platform swiss army knife for PDB files.

https://haddocking.github.io/pdb-tools/

Apache License 2.0

372 stars 113 forks source link

New tools for protein, nucleic, carbo, ligands #105

Closed joaomcteixeira closed 3 years ago

joaomcteixeira commented 3 years ago

Talking with @brianjimenez

We thought there could be four new tools to directly select for protein, nucleic, and carbohydrates, regardless of the chain where they sit. We could use the residue name identified for HADDOCK to perform the selection. For ligands, we could use the information in the PDB data bank; I already have files for that, and we could also do by negation.

These new tools could be named:

pdb_selprotein
pdb_selnucleic
pdb_selcarbo
pdb_selligands

Likewise, we could have the del version.

I am very mindful to the one-script-one-job philosophy, yet I think these scripts could enhance user experience without breaking the original philosophy.

What are your thoughts?

amjjbonvin commented 3 years ago

We thought there could be four new tools to directly select for protein, nucleic, and carbohydrates, regardless of the chain where they sit. We could use the residue name identified for HADDOCK https://wenmr.science.uu.nl/haddock2.4/library to perform the selection. For ligands, we could use the information in the PDB data bank; I already have files for that, and we could also do by negation.

This is a bit tricky as both ligands and glycans are labelled as HETATM. So you would really have to select based on residue names (i.e. a pre-defined list).

joaomcteixeira commented 3 years ago

Yes. That is exactly what we thought. Because piping these logics isn't straightforward with the current methods we thought on implementing there dedicated scripts based exactly on residue names for protein, nucleic, carbo, and ligands. Maybe carbo and ligands are tricky as a carbo can be a ligand. But for the rest it could be a good solution.

amjjbonvin commented 3 years ago

Also tricky are for example the modified amino acids, e.g. MSE (selenomethionine) - should not be filtered out as a ligand. Those scripts might get very much haddock-specific …

Yes. That is exactly what we thought. Because piping these logics isn't straightforward with the current methods we thought on implementing there dedicated scripts based exactly on residue names for protein, nucleic, carbo, and ligands. Maybe carbo and ligands are tricky as a carbo can be a ligand. But for the rest it could be a good solution.

JoaoRodrigues commented 3 years ago

I have the same concern as Alex. Residue names are very much defined by the forcefield/software. For instance, you can have CYS, CYH, depending if they are bonded or not. Ligands are even worse. We could use the ligandexpo table to screen for ligands, but that's a fairly large table (easily a few times larger than the entire pdb-tools codebase) and it's also not foolproof. The simplest would be a selprotein tool but even that will have a lot of corner cases where it won't work. We could have one that satisfies say 99.99% of the use cases but that's pretty much the only one where we can have such a good success rate without a lot of work.

amjjbonvin commented 3 years ago

Such scripts could be made HADDOCK-specific eventually and put in the haddock-tools repo

joaomcteixeira commented 3 years ago

Perfect. I will look forward to it.