mordred-descriptor / mordred

a molecular descriptor calculator
http://mordred-descriptor.github.io/documentation/master/
BSD 3-Clause "New" or "Revised" License
340 stars 91 forks source link

AtomTypeEState indices empty instead zero #63

Open tobigithub opened 5 years ago

tobigithub commented 5 years ago

Some of the AtomTypeEState indices for elements are empty. I believe they should be zero. for example in mordred the index 938 |   | MAXsLi | AtomTypeEState (‘max’, ‘sLi’)

This is an E-state index for the element Lithium. In mordred it is empty for a number of values.

image

When calculating the same data set with Dragon, the E-state indices for missing elements in the compound (Li, Be etc) are correctly zero. I am not sure if they relate exactly to the same E-state index. However if the element such as Li is absent, its zero.

image

Installed via anaconda environment

conda install -c rdkit -c mordred-descriptor mordred

python -m mordred --version
mordred-1.1.1

Example files: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

philopon commented 5 years ago

I know that max/min of EState of mordred are missing value and dragon's one are zero. And it is library design.

In mathematics, max/min of empty set are undefined. In mordred, missing value are returned instead of default value in such case, because it is easy imputing missing value, but it is hard distinguishing missing value from imputed values.

tobigithub commented 5 years ago

I am actually not a descriptor guy and I agree with the NA vs zero vs empty argument.

I would check what the mathematical equation says, E-state indices were published first in Kier and Hall see http://www.vcclab.org/lab/indexhlp/etstate.html (just an external ref with original references) and then I would check what other descriptor software tools calculate.

For the E-state index software I think the original reference is Molconn-Z http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html

Then its Dragon, Padel, Molgen-QSPR, MOLD, basically if every tools says zero (0) then its community voted or approved, otherwise mordred is different to all other tools.

I did not do any investigation of empty values vs zero values, but for deep learning it may not matter because its invariant and zero, I (personally) do not like empty values, because the may be replaced by min, mean, max values, but then again for chemical elements that are missing I think a zero value is just fine (no proof given).

philopon commented 5 years ago

I'm sorry for my late reply. And thank you for your detailed survey.

We can decide easily if it is mentioned by original paper or implementation. But, I cannot access Molconn-Z. Thus I cannot decide easily to adopt broken change.

However, for example, we can add like 'descriptor flavor' feature to chose such behavior.

calc = Calculator(descriptors, flator={AtomTypeEState: {"fill": 0}})

of course, this pseudo code is not best, and we should decide API carefully (I opened new issue #66 for discus about it).

plkx commented 3 years ago

This comment is for those suggesting generation of numerical values when none were actually determined. The program authors clearly know what they are doing and are thus far entirely correct in their approach.

Not determined = undefined.

Please do not generate values (0 or any other number) for undefined states!

For example, if there are no lithium atoms, why would they be assigned any value?

Even worse would be to assign zero to a descriptor that may take positive & negative values! Take some hypothetical descriptor that reflects partial charge on atoms. A zero value indicates that an atom has a charge between that of the tertiary carbon (~ +0.3) and the methyl groups (~ -0.5) in 2,2-dimethylpropane, for example. If charge calculation was not accomplished, then its value is undefined. Either omit the compound(s) without that descriptor value or omit that descriptor from further study. Meaningful regression is not possible otherwise.

Blank, NAN, or something comparable is the correct approach.

Regards,

plkx

tobigithub commented 3 years ago

@plkx I think NAN or NA is better than blank. With a blank one always wonders if something is wrong with the algorithm.