Closed PatWalters closed 1 year ago
Many thanks for the kind words! :)
I can confirm this is reproduced on the main
branch of the repo due to some RDKit descriptors producing nan
values before they are fed to the default model. We sadly can do little about this without changing the default input representation.
(Pdb) desc = next(iter(loader))
(Pdb) desc
tensor([[0., 0., 0., ..., 0., 0., 0.]])
(Pdb) torch.isnan(desc)
tensor([[False, False, False, ..., False, False, False]])
(Pdb) torch.isnan(desc).sum()
tensor(8)
That being said, the default model, which uses a mix of fingerprints and 2d descriptors has never seen SMILES with multiple fragments or containing non-organic elements. If you are still interested in getting (likely wrong) predictions for these, you can train an alternative model that only uses the fingerprint part - you can check how to do so by having a look at the provided train.py
script and the --featurizer_name
flag.
In any case I will add a cautionary note on the README.md
file.
Addressed in 0616ba0
MolSkillScorer.score returns nan for SMILES containing multiple fragments
e.g. s = MolSkillScorer() s.score(["C1(=O)NC(=O)N(Cl)C(=O)N1Cl.[Na]"])
array([nan], dtype=float32)
Perhaps this is the right thing to do. I noticed I also get nan with molecules containing non-organic elements. Either way, it would be good to document this. Great tool, btw, thanks!