Fingerprints in the CLM package

There are multiple places where chemical fingerprints are calculated in the ‘clm’ package, and this implies a decision about what fingerprinting algorithm to use. Currently, the code switches back and forth between ECFP6 (AllChem.GetMorganFingerprintAsBitVect) and RDKit (Chem.RDKFingerprint) fingerprints. The former yield higher accuracy when used as features in ML applications; the latter yields Tanimoto coefficients (Tc’s) that are more human-interpretable. Ideally, we would (1) set sensible defaults depending on the specific use case but also (2) allow user to select a different fingerprint if justified.

The specific places where fingerprints are calculated, and suggested defaults, are:

train_discriminator.py: fingerprints are used as input features for a classifier.

Here, it would be appropriate to use ECFP6 by default

write_nn_Tc.py: fingerprints are used to calculate Tc (e.g. between generated molecules and the training set, between PubChem and the training set).

Here, it probably makes more sense to use RDKit by default

write_structural_prior_CV.py: fingerprints are used to compare molecules suggested by model to the ground-truth

Here, RDKit should definitely be used by default
We will also need to apply this to PubChem in order to save RDKit fingerprints in base64
Given that we are pre-calculating fingerprints for PubChem, is there a way to ensure we avoid accidentally comparing one set of fingerprints calculated on the fly with another saved in base64? (e.g. comparing RDKit fingerprints with saved ECFP6 fingerprints)

calculate_outcomes.py: fingerprints are used to calculate internal/external diversity and internal/external nearest-neighbor

This one could really go either way, but ECFP6 is probably slightly preferable

There are also a couple uses of fingerprints in slides that haven’t been implemented yet, or at least are not found on the main branch:

Calculating top-k accuracy by minimum Tc (slide 9): RDKit would be the appropriate default
Nearest-neighbor Tc, ever vs. never generated (slide 10): RDKit would be the appropriate default

skinniderlab / CLM

Fingerprints in the CLM package #129