Closed sethaxen closed 7 years ago
Perhaps a good route would be to rename this repo to e3fp-utils or e3fp-sea, which we'll keep private, then separate out a standalone e3fp module that only covers core fingerprint-generation concepts like:
Then we can move the core module into its own new to-be-public repo called e3fp, and have e3fp-utils depend on it? Things like crossvalidation are currently more specific to our own internal testing & paper prep, and could always be moved over later.
*I'd suggest keeping the conformer generation code with the core e3fp, so that anyone using the final e3fp public repo can immediately use it to go all the way from SMILES/SDF/etc to usable fingerprints.
I think these are good suggestions. I've been keeping analysis scripts in a repo for the paper (currently a "rotation" repo), but crossvalidation probably belongs there. I agree, conformer generation is too useful to be separated. If I understand correctly, everything necessary to put E3FP to immediate use should be in the E3FP repo, and everything necessary only for replicating the analysis in the paper should be in a separate repo.
What do you mean by structural back-pointers?
Also, we should probably discuss at some point soon what should be public at time of publication and what shouldn't be. I envisioned all code used for analysis would be available at time of publication, but it sounds like key pieces like crossvalidation would not be?
Agreed on the separation of e3fp vs paper-analysis repos. I see no reason why paper-analysis couldn't also be public. If crossvalidation lived there, we'd just want to scrub the SEA parts. So maybe we're really talking the following repo structure:
As for structural back-pointers, I was just thinking of an example script for the on-disk dictionary/datafiles that store what part of the substructure a particular bit- (or count-) index refers to. E.g., Say we find that index 2563 is almost always on for compounds binding to a target of interest, what substructural pattern did that correspond to again--could it be written out into a standard pymol readable format etc?
I think this repo structure makes sense. As I'm writing analysis scripts, I'll work on splitting them out into a better organized analysis repo.
For back-pointers, because this information is already generated and discarded by the Fingerprinter
, that's probably the most sensible place to put it. This can (sort of) already be done with the store_identifiers_map
options in Fingerprinter
, but this is neither extensively tested nor that useful, as it just produces atom indices.
One way to do this would be to create a Substructure
class that is instantiated with other Substructure
s and is aware of their bond orders, etc (substructures are currently set
s of internal atom ids). During stereoscopic identifier assignment, the coordinates can be rotated to correspond to the axes generated in that assignment. This class could then have an option for writing out to a pdb, mol, whatever's most useful. I don't think this would be too complicated, but it would require some restructuring. This is worth its own feature request.
Hi @mjke, regarding the structural back-pointers, perhaps you could elaborate a bit more on how you'd you'd envision them being used.
e.g. one way of doing it would be to have PDB file for each identifier that has multiple models, each corresponding to a specific substructure from a specific conformer of a specific molecule. This has the advantage of easy reference, but is very inefficient for storage.
Another more space efficient way of doing it would be to have some sort of database mapping identifiers to molecule, conformer, center atom id, and 4 neighbor atom ids. This has the disadvantage of requiring more steps before being able to actually visualize the substructure.
Or perhaps the specific way you're thinking it will be used will reveal a better approach?
@sdaxen perhaps the second approach would make the most sense, paired with a ready-to-run script that could convert any single or inclusive set of indices into a single PDB/MOL file as needed? (Here I am using 'index' to mean the particular bit or count index in the final fingerprint that can be traced back to its matching substructural pattern--i.e., center atom id and 4 neighbors--within the appropriate conformer. Is that what you meant by identifier?)
@elcaceres what do you think?
Our initial use case would be in neural nets (e.g., keiserlab/neural-nets#2), where we'd like to train on E3FP fingerprints then trace back from any given unit in the network to the index or set of indices, at the input-feature level, that are originally causing that unit to fire. It'd be somewhat analogous to the approach in Riniker & Landrum, J Cheminform, 2013.
@mjke @sdaxen I am a fan of having some sort of database mapping IDs to atom IDs and a fan of this PDB/MOL file backtrace. At this point, I don't know how time consuming the generation of features will be, but I suspect (perhaps incorrectly) that grabbing a feature from a bit will not be the rate limiting factor.
As a curiosity, would it be possible to also keep the issues tracked before release private?
@elcaceres good question; per earlier discussion in this issue thread, my understanding is that we're planning to rename this repo to something like e3fp-lab-tools, and keep it private. Then we'll create a new public e3fp repo, which will not contain this repo's code revision or issue history.
Generating PDB's is pretty simple. I already have code for this, just not committed yet. Unfortunately, PDB's don't have great support for bond order, at least not as far as viz is concerned. I'll check if mol or mol2 will work better.
For my own reference, I'm putting this here: Each instance of each bit for each conformer for each molecule would be stored in the database (could be many instances). Probably the most space efficient way is to store enough info to recreate the substructure from the conformer, but not e.g. the coordinates themselves. Substructures are stored as center atom id, set of neighbor atom ids, radius, and 4x4 transformation matrix (or 2 quarternions; for aligning shells according to the axes determined in stereo mode). Database should probably have fast look-up by bit or by mol/conf. Need to look into options that don't require any user setup.
@sdaxen yes agreed that mol/2 might better encode bond order. SDF is another widely used format (e.g., see rdkit's SDWriter). Also agreed on your implementation notes.
Partially addressed with b7301b2. Only config
and examples
still have any SEA dependency: the former because it includes some defaults for the crossvalidation
submodule which is now in the e3fp-paper
repo, and the latter because the examples write to SEA molecules files. #11 will provide an alternative space-efficient way to store these fingerprints, and then the examples can be updated accordingly.
Fixed with 9c1077b.
Examples and data using SEA searching and output file formats are still present.
Fixed with 48aae5d
Before release, no SEA code should be present in this repo. Not certain the best place to put this, but it shouldn't be in the version history here either. A few ways to handle this off the top of my head:
sea_utils
in other modules and remove all importssea_utils
can be kept in a different repo (or private version of the repo? Not certain what features Github provides for this).