microsoft / molskill

Extracting medicinal chemistry intuition via preference machine learning
MIT License
103 stars 9 forks source link

MolSkillScorer.score returns nan for SMILES containing multiple fragments #2

Closed PatWalters closed 1 year ago

PatWalters commented 1 year ago

MolSkillScorer.score returns nan for SMILES containing multiple fragments

e.g. s = MolSkillScorer() s.score(["C1(=O)NC(=O)N(Cl)C(=O)N1Cl.[Na]"])

array([nan], dtype=float32)

Perhaps this is the right thing to do. I noticed I also get nan with molecules containing non-organic elements. Either way, it would be good to document this. Great tool, btw, thanks!

josejimenezluna commented 1 year ago

Many thanks for the kind words! :)

I can confirm this is reproduced on the main branch of the repo due to some RDKit descriptors producing nan values before they are fed to the default model. We sadly can do little about this without changing the default input representation.

(Pdb) desc =  next(iter(loader))
(Pdb) desc
tensor([[0., 0., 0.,  ..., 0., 0., 0.]])
(Pdb) torch.isnan(desc)
tensor([[False, False, False,  ..., False, False, False]])
(Pdb) torch.isnan(desc).sum()
tensor(8)

That being said, the default model, which uses a mix of fingerprints and 2d descriptors has never seen SMILES with multiple fragments or containing non-organic elements. If you are still interested in getting (likely wrong) predictions for these, you can train an alternative model that only uses the fingerprint part - you can check how to do so by having a look at the provided train.py script and the --featurizer_name flag.

In any case I will add a cautionary note on the README.md file.

josejimenezluna commented 1 year ago

Addressed in 0616ba0