Usage on non-protein structures.

andrewtarzia commented 3 years ago

This is a naive question - but does Caviar rely on any of the chemical properties of the protein/biomolecule? Or is it just element and coordinates that are important to the calculation? I would be interested in using something like this on structures I have generated that have PDB files.

jr-marchand commented 3 years ago

Hi Andrew,

Short answer: current implementation yes, but it should be straightforward to hack it.

Longer answer:

The parser by default works only on standard protein residues. It relies on a slightly modified ProDy PDB/CIF file parser, mostly in order to read information from the header for filtering.

However, the cavity detection routine itself just takes as input coordinates and element types, cf this function (file with atom sizes here). The calculation of descriptors happens after the detection of cavities and relies on (standard) protein atom and residue to assign pharmacophore types. This could also be disabled quite easily (here).

It depends on what you want to do, but you would need to play a bit with the code in order to adapt it to your needs.

Cheers, JR

andrewtarzia commented 3 years ago

Hey JR,

Thank you for the response. Ideally, I want to find cavities for a molecule (nothing like a protein) and would be interested in defining "pharmacophores" in the future but this would have to be done from scratch.

Focussing on the cavity detection and "output":

Is it appropriate to access find_protein_points_variablevolume by itself?
Is the grid generation algorithm within caviar and generalisable?

I am very interested in forking and having a go at this - I believe all I need is the input for that cavity detection and I would use Caviar more as a library.

Thank you,

Andrew

jr-marchand commented 3 years ago

It depends what you mean with molecule and cavity, but you may have to optimize a bit the cavity detection parameters if the cavities are very small. The current default set is aimed at finding drug-binding like binding sites in proteins. There's no reason it wouldn't work.

Is it appropriate to access find_protein_points_variablevolume by itself?

It may be, you just have to be careful with the input. It takes 4 arguments: gridpoints (simple grid object, defined here), protselection (cf hereafter), file_sizes (if you have additional elements to CNOS, you will need to add the vdw radii there) and size_probe (to define the solvent accessible surface).

"protselection" is what you want to hack. It is a selection object of ProDy type, but it does not matter. The loop defined in line 79 loops over each atomic element with radii indicated in the file file_sizes (essentially CNOS for proteins). For each element, it finds the coordinate of protein atoms corresponding to element (line 83 new_sel_coord = new_sel.getCoords()). And then it works on 3D coordinates. You can hack it by looping over elements of your molecule and inputing 3D coordinates sets in line 83. A coordinate set is a np.array of shape (n, 3), where n is your number of coordinates, eg:

[[ 23.052 7.401 -13.903] [ 23.052 7.401 -12.903] [ 23.052 7.401 -11.903] ... [ 79.052 46.401 23.097] [ 79.052 46.401 24.097] [ 79.052 46.401 25.097]]

2.Is the grid generation algorithm within caviar and generalisable?

... Yes ? I am not sure I understand the question. It takes min and max x,y,z coodinates, adds n Angstroms (2A by default) and generates a regular grid of defined spacing between grid points (1A by default). For performance (only), I implemented a function to filter out grid points further away than n Angstroms (5A by default). Therefore, the grid roughly follows the shape of the protein/molecule instead of being a large cubic box with many grid points very far of the molecule.

Good luck! It should be reasonably easy to use CAVIAR as a library (at least I hope, that's one thing I kept in mind while creating this platform). Don't hesitate to ask if some pieces of code aren't clear!

andrewtarzia commented 3 years ago

Good news is that I have hacked it into working shape (cavity is points, coloured by their cavity ID, molecule in cyan):

cavity_eg

Do you have a guide for parameter optimisation to be able to separate the internal and external 'cavity'?

jr-marchand commented 3 years ago

Awesome!

What parameters are you right now using? It's a bit curious that you get cavity points on the outer surface.

Considering the shape of your molecule, I would first deactivate all trimming routines (commenting this line should be enough). They exist mostly to prevent linkage of adjacent cavities, I don't think that's a risk here. The most important parameter to play with to tune the detection of cavities is radius_cube (default = 4). For each "solvent" grid point, the algorithm scans in the 14 cubic directions what are the types of the adjacent grid points. If it finds a protein/molecule grid point, it stops scanning that direction and increments a counter by 1. This counter can go from 0 to 14. 0 means the grid point is not in contact with any protein/molecule grid point in any of the 14 directions, 14 means that it is fully surrounded by protein/molecule atoms. The parameter radius_cube defines how far does the scan in the 14 directions goes. It counts in grid spacings, so the default of 4 scans up to 4 grid points in each direction. The other key parameter there is min_burial, which sets the minimum number of protein/molecule grid points to cross in order to define a cavity grid point (default = 8). (By the way defaults in function are not necessarily consistent with the optimized defaults for proteins, which are parsed from here)

Now if you make these two parameters more stringent (eg min_burial = 9 or even 10) you'll get only cavity grid points that are more buried. They should be only inside the molecule. However, this picture makes me a bit skeptical, I don't really understand what is possibly going on, which would push the detection as cavity grid points points on the outer surface with the default parameters (with min_burial = 8 a cavity grid point has to be surrounded in more than 50% of its volume by protein/molecule grid points; this does not seem to happen on the convex side of your example).

For reference, I also have routines working with the convex hull of your object, but they are not used in the code as they performed less good in my test sets. They could be relevant for your molecules, though.

jr-marchand / caviar

Usage on non-protein structures. #6