Reproducibility and reliability of ECFP descriptors

subercui commented 1 year ago

Hi, when I use the same model on the same molecule but run multiple times. I have different results. Please see the following:

Code snippets:

smiles_ = config["highlight_smiles"]
space = exmol.sample_space(smiles_, model_pred, batched=True, num_samples=1000)
exmol.lime_explain(space, descriptor_type="ECFP")
svg = exmol.plot_descriptors(space, return_svg=True)
skunk.display(svg)

Results of three runs:

I think this may be related to the randomness of the space, and setting a random seed somewhere can increase reproducibility? Meanwhile, I think the concern is more related to how I interpret the results? Is there a way to make it more reliable?

whitead commented 1 year ago

Great question @subercui! This is something on our list of things to explore. @geemi725 - this is an important point. Can we explore this a bit?

subercui commented 1 year ago

Thanks. I wonder what is the cause of the randomness and is there a way to relieve it to some degree? I tried increasing num_samples up to 10000. It doesn't seem to help

whitead commented 1 year ago

@subercui ECFP gives poor correlations on local vs global explanations often (depends on system) and those poor correlations make the p-values non-robust. MACCS is often better, or custom descriptors for your application. We're working on improving this, but it typically cannot be addressed by sample space. It's more a function of the fragments not fitting well.

ur-whitelab / exmol

Reproducibility and reliability of ECFP descriptors #136