rdkit / rdkit

The official sources for the RDKit library
BSD 3-Clause "New" or "Revised" License
2.66k stars 880 forks source link

mmCIF parser #2054

Open tdudgeon opened 6 years ago

tdudgeon commented 6 years ago

Hackathon idea, as discussed in the UGM talks:

See also #1584

osmart commented 6 years ago

The PDBeCIF parser (used by PDBe ccd_utils that Lukas Pravda spoked about at the UGM is a purely Python parser.

To include mmCIF parsing in the core RDKit codebase would involve using a C++ CIF parser. A very good C++ CIF parser I have used in the is the RCSB CIFPARSE-OBJ library.

I think it would be useful to look into using CIFPARSE-OBJ to provide the same functionality as the existing old style PDB format reading/writing in RDKit and I would be happy to look into this.

What do other people think?

osmart commented 6 years ago

A related but separate issue is providing RDKit users easy-to-use procedure to load ligands from released PDB entries (and/or ideal coordinates from the PDB Chemical Components Definition).

Currently the procedure used in the RDKit Cookbook section: https://www.rdkit.org/docs/Cookbook.html#d-functionality-in-the-rdkit using PDB-format files and SMILES string is not the easiest to use in practice.

It would be good to get procedures using RESTful API calls working?

@lpravda What do you think?

tdudgeon commented 6 years ago

One of my needs is to be able to:

  1. read a PDB structure (e.g. from pdb or mmCIF format)
  2. strip out the ligand(s) and waters (having some control over the process)
  3. modify the protein e.g. fix problems, adjust protonation state
  4. write out in formats suitable for downstream programs (e.g. in pdb, pdbqt, mol2 formats)

It would be lovely if all of this was possible in RDKit :-)

lpravda commented 6 years ago

Today I have learnt about the project-gemmi (https://github.com/project-gemmi/gemmi). In fact it is C++ so no need to come up with another parser? It is used for parsing mmcif files in CCP4. Definitely worth looking at @osmart. What do you think?

kemaeleon commented 6 years ago

The Gemmi project does indeed look very promising. Maybe it would be good at this stage of Gemmi development to try and connect with the project and make sure that the interface between, e.g. a residue in Gemmi and an rdmol object can work well,. Trying to re-use small molecule code to look after protein structures is difficult - and the last thing you want to do is to re-sanitize all residues in e.g. a ribosome just because the protonation state of one small molecule sitting on it has changed. @osmart @lpravda @tdudgeon what do you think ?

lpravda commented 6 years ago

Well, not sure if RDKit has internal representation of proteins or not. I don't think that parsing of macromolecules should be anyhow dependent on an rdkit.Chem.Mol object, so probably a new representation is needed.

tdudgeon commented 6 years ago

I'm not sure of the exact details here but I believe there is basic support for bio-polymers in RDKit in that you can label the atoms with chain and residue information, that way making it possible to handle specified residues or chains etc. It would be preferable to stick with the existing RDKit constructs rather than create something new.

lpravda commented 6 years ago

By all means, if there is some kind of basic support for macromolecules then is should be used and possibly extended. But I am not aware of any (which obviously does not mean a thing :)). There is a method MolFromFasta, which returns rdkit.Chem.Mol object. But I dont think it is a way to go to store small molecules as well as proteins in the same object type.

kemaeleon commented 6 years ago

rdkit.Chem.Mol object does not make sense for macromolecules, you do not want to express them as SMILES or generate conformers. But it could make sense for residues and ligands to be processed as small molecules by RDKit, if you could make sure that the libraries from e.g. Gemmi and RDKit are compatible.

Luthaf commented 6 years ago

Hi there! I am the author of another C++ CIF files parser, which you could be interested in: https://github.com/chemfiles/pacif.

It only does the file => std::vector<std::map<std::string, cif::value>> transformation, any interpretation of the data is left to the user of the code. I plan to use it for CIF and mmCIF support in chemfiles, but I did not yet had the time to use it. I am open to changes on the API if needed!

The main advantages of pacif are that it is a header only library depending only on the standard C++11 library, and the BSD license. The main inconvenient is that you still have to extract the data you need from the parsed structures, and that it is relatively recent and might still have some bugs.

bp-kelley commented 6 years ago

I disagree that one doesn’t want to hold conformations. NMR and X-RAY would certainly disagree with that.

This is not to mention non standard monomers and novel chemical rna linkers which can be difficult to be expressed as monomers.

If we take the view that everything is chemistry, then we resolve to atoms anyway. A particularly useful approach is a hierarchical view over a molecule ( this is the OpenEye approach ). Another is a representation that resolves to a molecule for comparison purposes this is kind of HELM and sugar and splice approach.

But one fundamental question is what do you want to do with the representation? If you want tautomers/charges/linkages/substructure searches you may as well have a Mol backend.

If you want to mutate base pairs or amino acids, there are other suitable approaches. Note that I’m biased by X-ray/nmr where a mol with a monomer hierarchy view is quite useful.

Brian Kelley

On Oct 23, 2018, at 5:58 AM, Guillaume Fraux notifications@github.com wrote:

Hi there! I am the author of another C++ CIF files parser, which you could be interested in: https://github.com/chemfiles/pacif.

It only does the file => std::vector<std::map<std::string, cif::value>> transformation, any interpretation of the data is left to the user of the code. I plan to use it for CIF and mmCIF support in chemfiles, but I did not yet had the time to use it. I am open to changes on the API if needed!

The main advantages of pacif are that it is a header only library depending only on the standard C++11 library, and the BSD license. The main inconvenient is that you still have to extract the data you need from the parsed structures, and that it is relatively recent and might still have some bugs.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

wojdyr commented 6 years ago

Hi, Gemmi developer here. I just came across this discussion. I often hear about RDKit because it's used in various CCP4 projects. Actually, when we started Gemmi, Paul Emsley was advocating BSD license to be compatible with RDKit, but finally the Mozilla Public License was chosen (like for Eigen3).

If you'd like to use Gemmi I'd be happy to help.

@tdudgeon: Gemmi could do points 1 and 2 (and 4 for the pdb format) out of the 4 points that you listed. If it'd be useful to export a macromolecule to some of the existing RDKit data structures for further processing maybe we can sort it out. I haven't used RDKit myself, but a couple days ago I did perhaps a similar exercise: I've spent many hours to figure out how to use gemmi together with the Boost Graph Library to calculate maximum common subgraph.

pschmidtke commented 3 years ago

What was the outcome of this discussion @greglandrum ? Any of these approaches was ever integrated or not in the end?

greglandrum commented 3 years ago

@pschmidtke: I'm not aware of any work that has been done on this one yet.

pschmidtke commented 3 years ago

Ok, thanks -> openbabel it is for now then ;)