Database I/O and conformer/fingerprint storage

aparente-nurix commented 5 years ago

I'd like to use e3fp fingerprints on a very large database of molecules (~millions, possibly billions).

I was wondering if you had any benchmarks on speed and conformer/fingerprint storage sizes. Whats the largest dataset you've applied this to?

Thanks!

sethaxen commented 5 years ago

The most comprehensive benchmarks we've run with E3FP are in Table S3 and Figure S10 of the supplement of the paper. I've included them below:

The code that ran these benchmarks is here: https://github.com/keiserlab/e3fp-paper/tree/master/project/benchmark.

As you can see, we haven't rigorously benchmarked on more than 308,315 molecules (ChEMBL20). The runtime should scale linearly with database size. Note that when we scaled from 10,000 to 308,315 molecules, E3FP still takes on average ~83s and ~0.7s per molecule for conformer generation and fingerprinting, respectively. While runtime of fingerprinting scales sub-linearly with the number of heavy atoms, conformer generation scales super-linearly with the same heavy atoms, so if your database contains very large, flexible molecules (e.g. peptides), these will tend to take a long time to run conformer generation, and that could use up all of your processors.

sethaxen commented 5 years ago

Regarding storage sizes, I haven't run any benchmarks in this area. E3FP's default storage approach is described here. Since it's just a light wrapper of a scipy.sparse.csr_matrix, its performance will be limited by that format. On the databases we've used, we are able to just hold the database in memory until fingerprinting is completed, when we write it to a file. I suspect a database with fingerprints of billions of molecules will exceed the memory of most machines, so a different storage option will probably be necessary, perhaps something like HDF5. I'm happy to take suggestions and pull requests in this area.

mjke commented 5 years ago

great points. a couple thoughts:

for conformer generation, if speed is a concern, you might consider commercial packages like omega; e3fp doesn't fundamentally rely on our particular choice of confgen tool.
for more flexible storage formats, perhaps n5 or zarr

keiserlab / e3fp

Database I/O and conformer/fingerprint storage #36