See #4 for comment thread regarding structural backpointers. Parts relevant to implementation are included below:
@mjke wrote:
As for structural back-pointers, I was just thinking of an example script for the on-disk dictionary/datafiles that store what part of the substructure a particular bit- (or count-) index refers to. E.g., Say we find that index 2563 is almost always on for compounds binding to a target of interest, what substructural pattern did that correspond to again--could it be written out into a standard pymol readable format etc?
@sdaxen wrote:
Each instance of each bit for each conformer for each molecule would be stored in the database (could be many instances). Probably the most space efficient way is to store enough info to recreate the substructure from the conformer, but not e.g. the coordinates themselves. Substructures are stored as center atom id, set of neighbor atom ids, radius, and 4x4 transformation matrix (or 2 quarternions; for aligning shells according to the axes determined in stereo mode). Database should probably have fast look-up by bit or by mol/conf. Need to look into options that don't require any user setup.
Today, @mjke and I discussed ultimately implementing an SQLite database with three tables. The first table maps a mol name and conformer id to an index (unsigned, unfolded, but could easily be folded) in that fingerprint. A second table contains multiple rows mapping that index in that conformer to sub-indices (the indices that were hashed to make that index). A third table maps sub-indices to specific atom ids. This should enable fast querying for all substructures that match bit indices and should enable differentiation between colliding bits. A series of helper functions will exist for querying the database so the user never has to deal with SQL.
For now, to get something available for @elcaceres to work with, a quick and dirty implementation will be a Pandas dataframe that maps mol name, conf_id, and index to a sorted tuple of child indices, a tuple of the corresponding atom ids in the same order, a sorted tuple of the other atom ids in the substructure but not explicitly within the radius, a radius, and a quaternion for transformation. For convenience, a helper function will be provided for writing one of these rows to a transformed PDB, (or mol2 or sdf) given the mol and the row, but no other helper functions will be provided until the SQL implementation.
To-do:
[ ] Implement Pandas dataframe-based backpointers
[ ] Add convenience function for writing to PDB
[ ] Implement SQLite-based backpointers
[ ] Add convenience functions for querying database
See #4 for comment thread regarding structural backpointers. Parts relevant to implementation are included below:
@mjke wrote:
@sdaxen wrote:
Today, @mjke and I discussed ultimately implementing an SQLite database with three tables. The first table maps a mol name and conformer id to an index (unsigned, unfolded, but could easily be folded) in that fingerprint. A second table contains multiple rows mapping that index in that conformer to sub-indices (the indices that were hashed to make that index). A third table maps sub-indices to specific atom ids. This should enable fast querying for all substructures that match bit indices and should enable differentiation between colliding bits. A series of helper functions will exist for querying the database so the user never has to deal with SQL.
For now, to get something available for @elcaceres to work with, a quick and dirty implementation will be a Pandas dataframe that maps mol name, conf_id, and index to a sorted tuple of child indices, a tuple of the corresponding atom ids in the same order, a sorted tuple of the other atom ids in the substructure but not explicitly within the radius, a radius, and a quaternion for transformation. For convenience, a helper function will be provided for writing one of these rows to a transformed PDB, (or mol2 or sdf) given the mol and the row, but no other helper functions will be provided until the SQL implementation.
To-do: