store fragments in a SQLite db instead of JSON-Lines

Resolved issues:

Default fragment output filename?
- If no output is specified, create a name based on structure filename
- If there is no structure filename (reading from stdin) then use 'input.fragdb'
Alternative text output?
- No. SQL seems more usable.
Normalize the fragment SMILES?
- No. Doesn't save memory.
What indices are needed?
- Need fragmentation -> record id for indexing
- Otherwise, none. While unindexed tasks are ~2-3x slower, indices take up space most won't need.
- Could have a user-level command to add additional indices if needed.
Completely remove support for the old 'fragments' format?
- Yes. Makes the code simpler.
- Will require people to re-fragment everything.

Currently the fragment file stores the fragmentation in JSON-Lines format. After an initial header (with version and option lines) are a sequence of "RECORD" or "IGNORE" lines, each structured as a JSON list. For RECORD lines, there are 10 fields, with the last a list of fragmentations.

I propose switching the fragment command to save to a SQLite database (proposed extension, ".fragdb"). Analysis in SQL is so much easier than writing one-off Python programs

I have pull-request implementing this proposal.

Pros and Cons

I see several advantages of using SQLite instead of a flat(ish) file:

simplify the fragment file I/O by using SQL instead of Python
potentially speed up loading the data I/O step because of less data validation in Python and no need to go through JSON
don't need to load all of the data into memory for re-fragmentation
more easily enable analysis and filtering of the fragments

To clarify the last point, consider a tool to examine the most common fragments, or to select only a few constants. This sort of tool could be written as a SQL command rather than read the data into memory followed by some Python-based data processing.

The disadvantages I can think of are:

Requires people to regenerate their fragmentations in the new format
- unless a converter is developed?
- which I don't think is needed
The SQLite file is about the same size, though about 10-15x larger than gzip'ed fragments
- indexing on the constant and variable parts roughly doubles the size

Maybe there are other downsides?

Default fragment filename

One issue is how to handle the default fragment file. Currently mmpdb fragment will write to stdout if no -o option is given. This does not work with SQLite output file.

I could:

require a -o file
use a default name for that case (eg, input.fragdb)
synthesize a name based on the fragment file

I have decided that if you do not specify -o then the default output is the structure filename, with possible extension and .gz removed and replaced with .fragdb. If the structure file is AB.smi.gz then the default output fragment database name is AB.fragdb. If the structure file is CD.smi then the default output name is CD.fragdb.

This decision was guided by the need to distribute fragmentations across a cluster (rather than the simple Python-based multiprocess fragmentation now). In that case, your process will be something like:

o Split the input SMILES files into N parts

convert 'input.smi' -> 'input.part001.smi', 'input.part002.smi', ...
perhaps via a mmpdb smi_split input.smi command?

o Fragment each SMILES file (using a fake cluster queue submission)

qsub --exec 'mmpdb fragment {} --cache prev_job.fragdb' --files *.smi
could process each of the SMILES files to 'input.part001.fragdb', 'input.part002.fragdb', ...

o Merge the fragments back into a single file:

mmpdb fragdb_merge input.part*.mmpdb -o input.fragdb

If the mmpdb fragment step used the input SMILES filename to influence how the default output name is determined (in this case, from input.part001.smi to input.part001.fragdb) then there wouldn't need to be a filename manipulation step here.

Alternative text output

Currently there is the option to display the fragments in "fraginfo" format. This was an undocumented option to display the text in a more human-readable format. It does not appear to be used, as the code clearly says "Fix this for Python 3". I suspect it can be removed without a problem.

Still, perhaps there is a reason to have a way to view the fragmentations in a more human readable format? For example:

mmpdb fragment_dump whatever.fragdb
mmpdb fragment_dump --id ABC123 whatever.fragdb

However, it just doesn't seem useful. It's so much easier to do the query in SQL.

normalize the fragment SMILES?

Should the SMILES strings in the fragdb database be normalized? (That is, all 23,256 occurrences of *C.*C would be normalized to an integer id in a new smiles table, and the fragmentation SMILES stored by id reference, rather than storing the actual string.)

I used the ChEMBL_CYP3A4_hERG.smi test set from the original mmpdb development, with 20267 SMILES strings. Using a denormalized data set (constant and fragment SMILES are stored as-is), the resulting sizes are:

% ls -lh ChEMBL_CYP3A4_hERG.frag*
-rw-r--r--  1 dalke  admin   139M Oct 12 13:30 ChEMBL_CYP3A4_hERG.fragdb
-rw-r--r--  1 dalke  admin   146M Oct 12 12:41 ChEMBL_CYP3A4_hERG.fragments

This shows that "fragdb" is slightly more compact than "fragments".

On the other hand, gzip -9 produces a ChEMBL_CYP3A4_hERG.fragments.gz which is 1/13th the size, at 11MB bytes (153151552/11656549 = 13.1).

A SQL query suggests I can save about 50MB by normalizing the duplicate fragment SMILES, which is about 40% of the file size.

sqlite> select sum(length(s) * N), sum(length(s) * (N-1)) FROM (select s, count(*) as N FROM (SELECT constant_smiles AS s FROM fragmentation UNION ALL SELECT variable_smiles AS s FROM fragmentation) group by s);
84360180|32669074

On the other hand, that estimate doesn't fully include the normalization table, nor does it include the indices which may be needed for possible analyzes.

(The constant_with_H_smiles and record's input_smiles and normalized_smiles have few duplicate values so should not be normalized.)

I changed the code to normalize the fragments in a 'smiles' table and regenerated the data set. The new one is second, with the "hERG2"

% ls -lh ChEMBL_CYP3A4_hERG.fragdb ChEMBL_CYP3A4_hERG2.fragdb
-rw-r--r--  1 dalke  admin   139M Oct 12 13:30 ChEMBL_CYP3A4_hERG.fragdb
-rw-r--r--  1 dalke  admin   189M Oct 12 16:26 ChEMBL_CYP3A4_hERG2.fragdb

The resulting size is larger because it contains the SMILES normalization table, and the indexing needed for the UNIQUE constraint. It is still roughly the same size as the uncompressed fragments file, though quite larger than the gzip-compressed fragments.

What indices are needed?

There must be an index mapping the each fragmentation to its record. I tried a version without that index and mmpdb index was obviously slower, even for a test set of only 1,000 SMILES.

The database should be index to support the types of analyses which might be done on the fragment data. At present I don't know what these are likely to be. Some likely ones are:

Merge multiple fragment data sets into one (eg, using 20 machines to fragment structures in parallel)
Determine the distribution of constants and/or variable fragments?
Generate a subset containing only specified constants.
Partition the dataset into N groups, based on their constants.

I modified the two datasets to make them index by the constant and variable parts. For the de-normalized hERG.fragdb I indexed the constant and variable SMILES strings. For the normalized hERG2.fragdb I indexed the constant and variable SMILES ids.

-rw-r--r--    1 dalke  admin   241M Oct 12 16:55 ChEMBL_CYP3A4_hERG_with_idx.fragdb
-rw-r--r--    1 dalke  admin   235M Oct 12 17:04 ChEMBL_CYP3A4_hERG2_with_idx.fragdb

This nearly doubles the database size. This also shows that normalization doesn't affect the database size if both the constants and variables need to be indexed.

It's hard to judge if this increase in size is useful without tests on the types of analyses to do, so I used the above "likely ones".

merge multiple data sets into one

This is trivial with unnormalized version - merge the two sets of tables, and update the ids so they don't overlap. The normalized version is a bit more complex as the normalization tables must be merged.

Determine the distribution of constants

The following (in the unnormalized data set) prints the distribution of constant SMILES, ordered by count:

select count(*), constant_smiles from fragmentation GROUP BY constant_smiles ORDER BY count(*) DESC

Given the fragmentations from 20K molecules, this takes a bit over 2 seconds on the unindexed,unnormalized data set.

If the SMILES strings are indexed, but still unnormalized, it takes a bit over 1 second.

If the SMILES strings are normalized and indexed it's 0.66 seconds. That's about 3x faster.

Given that this will likely not be common, I suggest staying with unnormalized strings, and no index. Perhaps there can be an mmpdb fragdb_index command to add indices if people want a 2x speedup.

Bear in mind that currently analysis must do a linear search of the fragments file, decoding the JSON-Lines, and building the search into memory. This probably takes much longer, though I haven't tested it.

Generate a subset of fragments given the constants

There are a couple of ways I could think of to select a set of fragments: 1) specify the SMILES strings, and 2) request only those in the top-P%, or top N, or those with at least M occurrences.

In both cases, I think the right solution is to make a temporary table containing the selected fragments, then use that to select the subset of fragments and of records containing those fragments. For example, the following selects all records with a fragmentation with a constant SMILES is one of the 10,000 most common.

attach database ":memory:" as tmp;

CREATE TABLE tmp.selected_fragmentation (
    fragment_id INTEGER 
);

INSERT INTO tmp.selected_fragmentation
    SELECT id 
    FROM fragmentation
    WHERE constant_smiles IN (
       SELECT constant_smiles
         FROM fragmentation
   GROUP BY constant_smiles
   ORDER BY count(*)
          DESC
         LIMIT 10000);

analyze tmp;

select record.id from record, fragmentation where record.id = fragmentation.record_id AND fragmentation.id in (select fragment_id from tmp.selected_fragmentation);

The :memory: + index + analyze is slightly faster than doing a straight search on the unindexed database. Note the above only filters the records, when full export will also need to export the fragments, which requires another search. That's why I think the temporary index is worthwhile.

Partition based on constants

It looks like this can be done with a similar mechanism - create a table, randomize the order, split into M parts, and save to distinct output databases.

Drop support for the old 'fragments' format

The current pull request drops supports the old 'fragments' format if the filename ends with .fragments or .fragments.gz. This simplifies the relevant code by quite a bit, compared to a version which must support support both I/O formats.

The feedback from Mahendra is that this isn't an issue.

An middle solution is to only support the old 'fragments' format in the --cache option, which would let people upgrade to the new system without having to re-fragment everything. I don't think that's needed as re-fragmentation, while slow, is doable.

rdkit / mmpdb