Closed adalke closed 3 years ago
I've finalized the pull request, addressing the issues brought up in this discussion.
Feedback is to go ahead with this and drop all support for the old "fragments" JSON-Lines format.
That's now part of my working branch for the upcoming 3.0 release.
Resolved issues:
Currently the fragment file stores the fragmentation in JSON-Lines format. After an initial header (with version and option lines) are a sequence of "RECORD" or "IGNORE" lines, each structured as a JSON list. For RECORD lines, there are 10 fields, with the last a list of fragmentations.
I propose switching the fragment command to save to a SQLite database (proposed extension, ".fragdb"). Analysis in SQL is so much easier than writing one-off Python programs
I have pull-request implementing this proposal.
Pros and Cons
I see several advantages of using SQLite instead of a flat(ish) file:
To clarify the last point, consider a tool to examine the most common fragments, or to select only a few constants. This sort of tool could be written as a SQL command rather than read the data into memory followed by some Python-based data processing.
The disadvantages I can think of are:
Maybe there are other downsides?
Default fragment filename
One issue is how to handle the default fragment file. Currently
mmpdb fragment
will write to stdout if no-o
option is given. This does not work with SQLite output file.I could:
-o
fileinput.fragdb
)I have decided that if you do not specify
-o
then the default output is the structure filename, with possible extension and.gz
removed and replaced with.fragdb
. If the structure file isAB.smi.gz
then the default output fragment database name isAB.fragdb
. If the structure file isCD.smi
then the default output name isCD.fragdb
.This decision was guided by the need to distribute fragmentations across a cluster (rather than the simple Python-based multiprocess fragmentation now). In that case, your process will be something like:
o Split the input SMILES files into N parts
mmpdb smi_split input.smi
command?o Fragment each SMILES file (using a fake cluster queue submission)
o Merge the fragments back into a single file:
If the
mmpdb fragment
step used the input SMILES filename to influence how the default output name is determined (in this case, frominput.part001.smi
toinput.part001.fragdb
) then there wouldn't need to be a filename manipulation step here.Alternative text output
Currently there is the option to display the fragments in "fraginfo" format. This was an undocumented option to display the text in a more human-readable format. It does not appear to be used, as the code clearly says "Fix this for Python 3". I suspect it can be removed without a problem.
Still, perhaps there is a reason to have a way to view the fragmentations in a more human readable format? For example:
However, it just doesn't seem useful. It's so much easier to do the query in SQL.
normalize the fragment SMILES?
Should the SMILES strings in the fragdb database be normalized? (That is, all 23,256 occurrences of
*C.*C
would be normalized to an integer id in a newsmiles
table, and the fragmentation SMILES stored by id reference, rather than storing the actual string.)I used the ChEMBL_CYP3A4_hERG.smi test set from the original mmpdb development, with 20267 SMILES strings. Using a denormalized data set (constant and fragment SMILES are stored as-is), the resulting sizes are:
This shows that "fragdb" is slightly more compact than "fragments".
On the other hand,
gzip -9
produces aChEMBL_CYP3A4_hERG.fragments.gz
which is 1/13th the size, at 11MB bytes (153151552/11656549 = 13.1).A SQL query suggests I can save about 50MB by normalizing the duplicate fragment SMILES, which is about 40% of the file size.
On the other hand, that estimate doesn't fully include the normalization table, nor does it include the indices which may be needed for possible analyzes.
(The
constant_with_H_smiles
and record'sinput_smiles
andnormalized_smiles
have few duplicate values so should not be normalized.)I changed the code to normalize the fragments in a 'smiles' table and regenerated the data set. The new one is second, with the "hERG2"
The resulting size is larger because it contains the SMILES normalization table, and the indexing needed for the UNIQUE constraint. It is still roughly the same size as the uncompressed fragments file, though quite larger than the gzip-compressed fragments.
What indices are needed?
There must be an index mapping the each fragmentation to its record. I tried a version without that index and
mmpdb index
was obviously slower, even for a test set of only 1,000 SMILES.The database should be index to support the types of analyses which might be done on the fragment data. At present I don't know what these are likely to be. Some likely ones are:
I modified the two datasets to make them index by the constant and variable parts. For the de-normalized hERG.fragdb I indexed the constant and variable SMILES strings. For the normalized hERG2.fragdb I indexed the constant and variable SMILES ids.
This nearly doubles the database size. This also shows that normalization doesn't affect the database size if both the constants and variables need to be indexed.
It's hard to judge if this increase in size is useful without tests on the types of analyses to do, so I used the above "likely ones".
merge multiple data sets into one
This is trivial with unnormalized version - merge the two sets of tables, and update the ids so they don't overlap. The normalized version is a bit more complex as the normalization tables must be merged.
Determine the distribution of constants
The following (in the unnormalized data set) prints the distribution of constant SMILES, ordered by count:
Given the fragmentations from 20K molecules, this takes a bit over 2 seconds on the unindexed,unnormalized data set.
If the SMILES strings are indexed, but still unnormalized, it takes a bit over 1 second.
If the SMILES strings are normalized and indexed it's 0.66 seconds. That's about 3x faster.
Given that this will likely not be common, I suggest staying with unnormalized strings, and no index. Perhaps there can be an
mmpdb fragdb_index
command to add indices if people want a 2x speedup.Bear in mind that currently analysis must do a linear search of the fragments file, decoding the JSON-Lines, and building the search into memory. This probably takes much longer, though I haven't tested it.
Generate a subset of fragments given the constants
There are a couple of ways I could think of to select a set of fragments: 1) specify the SMILES strings, and 2) request only those in the top-P%, or top N, or those with at least M occurrences.
In both cases, I think the right solution is to make a temporary table containing the selected fragments, then use that to select the subset of fragments and of records containing those fragments. For example, the following selects all records with a fragmentation with a constant SMILES is one of the 10,000 most common.
The
:memory:
+ index + analyze is slightly faster than doing a straight search on the unindexed database. Note the above only filters the records, when full export will also need to export the fragments, which requires another search. That's why I think the temporary index is worthwhile.Partition based on constants
It looks like this can be done with a similar mechanism - create a table, randomize the order, split into M parts, and save to distinct output databases.
Drop support for the old 'fragments' format
The current pull request drops supports the old 'fragments' format if the filename ends with
.fragments
or.fragments.gz
. This simplifies the relevant code by quite a bit, compared to a version which must support support both I/O formats.The feedback from Mahendra is that this isn't an issue.
An middle solution is to only support the old 'fragments' format in the
--cache
option, which would let people upgrade to the new system without having to re-fragment everything. I don't think that's needed as re-fragmentation, while slow, is doable.