skinniderlab / CLM

MIT License
0 stars 0 forks source link

Fingerprints in the CLM package #129

Open skinnider opened 4 months ago

skinnider commented 4 months ago

There are multiple places where chemical fingerprints are calculated in the ‘clm’ package, and this implies a decision about what fingerprinting algorithm to use. Currently, the code switches back and forth between ECFP6 (AllChem.GetMorganFingerprintAsBitVect) and RDKit (Chem.RDKFingerprint) fingerprints. The former yield higher accuracy when used as features in ML applications; the latter yields Tanimoto coefficients (Tc’s) that are more human-interpretable. Ideally, we would (1) set sensible defaults depending on the specific use case but also (2) allow user to select a different fingerprint if justified.

The specific places where fingerprints are calculated, and suggested defaults, are:

train_discriminator.py: fingerprints are used as input features for a classifier.

write_nn_Tc.py: fingerprints are used to calculate Tc (e.g. between generated molecules and the training set, between PubChem and the training set).

write_structural_prior_CV.py: fingerprints are used to compare molecules suggested by model to the ground-truth

calculate_outcomes.py: fingerprints are used to calculate internal/external diversity and internal/external nearest-neighbor

There are also a couple uses of fingerprints in slides that haven’t been implemented yet, or at least are not found on the main branch: