schrodinger / coordgenlibs

Schrodinger-developed 2D Coordinate Generation
BSD 3-Clause "New" or "Revised" License
42 stars 28 forks source link

Coordgen is slower than RDKit's native 2d coordinate generation #39

Open d-b-w opened 5 years ago

d-b-w commented 5 years ago

Coordgen is slower than RDKit's native 2d coordinate generation. Average speeds are about 100x slower, and in the worse cases, coordgen can take multiple seconds.

The two tools don't do the same things, and I think that coordgen results are much better, so the comparison is not totally fair. I do think that that coordgen should target being able to consistently produce coordinates in less than 0.1s, and have averages closer to 0.001s. This will allow us to discuss making coordgen the default in RDKit, which would be cool.

I'm going to link to the internal Schrödinger bug tracker, and our internal display for performance testing below, sorry...

At the time I post this, our automated performance testing says that:

2d coordinate generator Average speed (s) Slowest (s) Count > 0.1s Count > 1s
RDKit native 0.00035 0.04 0 0
coordgen 0.028 3.9 17 235
ricrogz commented 5 years ago

@d-b-w, it might be a good idea to add some of the molecules (especially the slow ones) from these benchmarks as tests in this repository.

ptosco commented 3 years ago

Sorry for reviving this 2-year old ticket. I have just stumbled on the same problem on an internal dataset using the latest RDKit 2021.03.1 release. So I decided to reproduce the problem on public data and I fetched 2000 indoles with 50 to 60 heavy atoms from ChEMBL (csv file attached) chembl27_2000_indoles_50-60_ha.csv.gz

Native RDKit depiction of these 2000 molecules takes ~3 s:

%%time
rdDepictor.SetPreferCoordGen(False)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 3.02 s, sys: 23 ms, total: 3.05 s
Wall time: 3.04 s

CoordGen takes ~360x longer:

%%time
rdDepictor.SetPreferCoordGen(True)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 18min 10s, sys: 868 ms, total: 18min 11s
Wall time: 18min 10s

At the moment, this means that CoordGen cannot be used to depict large-ish molecules in a table. Do you have plans to address this in the near future? Thanks a lot in advance.

d-b-w commented 3 years ago

ugh, we just accidentally blew up coordgen time by at least 10x, which should be addressed in - #90

Sorry about that. When #90 is merged, I'll immediately issue a patch release of coordgen and post a PR to RDKit.

We're definitely hoping to do further work on this before the fall RDKit release. The bug in #90 actually provides some clues to next steps.

ptosco commented 3 years ago

Thank you for the super-fast reply, Dan! Looking forward to the PR.

ptosco commented 3 years ago

Thanks Dan! It looks much better now :-)

%%time
rdDepictor.SetPreferCoordGen(True)
for m in chembl_mols_2000:
    rdDepictor.Compute2DCoords(m)
CPU times: user 2min 5s, sys: 53 ms, total: 2min 5s
Wall time: 2min 5s
d-b-w commented 3 years ago

great! This issue is should remain open; I feel like the current rate is still too slow. But it's acceptable for many use cases.