microsoft / molskill

Extracting medicinal chemistry intuition via preference machine learning
MIT License
103 stars 9 forks source link

MolSkillScorer.score returns nan for "unusual" atom types #3

Open PatWalters opened 1 year ago

PatWalters commented 1 year ago

This is a follow-up to the previous issue I reported. It turns out that the problem isn't with molecules that have multiple fragments. The problem is that a few of the RDKit descriptors return nan when they encounter atom types that are not parameterized. The impacted descriptors are listed below. I'm willing to bet you could remove these from the descriptors you're currently using without impacting performance. Then again, these molecules are probably outside your applicability domain, and MolSkillScorer.score should return nan.

from rdkit.Chem.Descriptors import BCUT2D_MWHI, MaxPartialCharge
from rdkit import Chem

a = BCUT2D_MWHI(Chem.MolFromSmiles("CCC[Se]CCC"))
b = MaxPartialCharge(Chem.MolFromSmiles("CCC[Se]CCC"))
a,b
(nan, nan)

Here are the problematic descriptors

BCUT2D_MWHI BCUT2D_MWLOW BCUT2D_CHGHI BCUT2D_CHGLO BCUT2D_LOGPHI BCUT2D_LOGPLOW BCUT2D_MRHI BCUT2D_MRLOW MaxPartialCharge MinPartialCharge MaxAbsPartialCharge MinAbsPartialCharge

josejimenezluna commented 1 year ago

Hi @PatWalters. Many thanks for identifying the problematic descriptors!

While I believe these molecules are for sure outside of the applicability domain, I feel that is up to the user to be conscious of this rather than us returning nan values.

I'll go check whether removing these descriptors impacts performance in any significant way and remove them from the default featurizer/model if it is not the case.

SejeongPark8354 commented 8 months ago

Hi @josejimenezluna. Could you please provide an update on the effects of removing these specific descriptors from the default featurizer/model? I am interested in knowing if there have been any recent findings or observations regarding the impact this has on the performance of the analysis or model.