mordred-descriptor / mordred

a molecular descriptor calculator
http://mordred-descriptor.github.io/documentation/master/
BSD 3-Clause "New" or "Revised" License
340 stars 91 forks source link

Question: Searching and filtering mordred error/Missing values #83

Open ravichas opened 4 years ago

ravichas commented 4 years ago

Hello All:

-- I am computing descriptors for a set of compounds using a code similar to the one shown below:

# create descriptor calculator with ALL descriptors
suppl = Chem.SmilesMolSupplier('Data/bzr.smi')
mols = [x for x in suppl]
calc = Calculator(descriptors, ignore_3D=True)
test_df = calc.pandas(mols)

Some of the test_df cells have errors/missing values with messages like this, "max() arg is an empty sequence (MAXsLi)". Other datasets have similar messages like, "float division by zero (MDEC-14)" etc. After going through the Mordred descriptor page, I understand what these mean. But, I am having difficulty in grepping them or assigning them with a fixed numerical value for later analysis. My question, is there a way, I can handle missing values differently in Mordred.

Thanks very much for your time and help.

Ravi

--

environment

OS/distribution

Windows 10

conda or pip

conda

python version

3.8.1

library version

Name Version Build Channel

attrs 19.3.0 py_0 conda-forge backcall 0.1.0 py_0 conda-forge blas 1.0 mkl anaconda bleach 3.1.0 py_0 conda-forge boost 1.70.0 py38h79cbd7a_1 conda-forge boost-cpp 1.70.0 h6a4c333_2 conda-forge ca-certificates 2019.11.28 hecc5488_0 conda-forge cairo 1.16.0 h60892f0_1002 conda-forge certifi 2019.11.28 py38_0 conda-forge colorama 0.4.3 py_0 conda-forge cycler 0.10.0 py_2 conda-forge decorator 4.4.1 py_0 conda-forge defusedxml 0.6.0 py_0 conda-forge entrypoints 0.3 py38_1000 conda-forge freetype 2.10.0 h563cfd7_1 conda-forge icc_rt 2019.0.0 h0cc432a_1 anaconda icu 64.2 he025d50_1 conda-forge importlib_metadata 1.4.0 py38_0 conda-forge inflect 4.0.0 py38_1 conda-forge ipykernel 5.1.4 py38h5ca1d4c_0 conda-forge ipymol 0.5 pypi_0 pypi ipython 7.11.1 py38h5ca1d4c_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge jaraco.itertools 5.0.0 py_0 conda-forge jedi 0.16.0 py38_0 conda-forge jinja2 2.10.3 py_0 conda-forge joblib 0.14.1 py_0 anaconda jpeg 9c hfa6e2cd_1001 conda-forge json5 0.8.5 py_0 conda-forge jsonschema 3.2.0 py38_0 conda-forge jupyter_client 5.3.4 py38_1 conda-forge jupyter_core 4.6.1 py38_0 conda-forge jupyterlab 1.1.4 py_0 conda-forge jupyterlab_server 1.0.6 py_0 conda-forge kiwisolver 1.1.0 py38he980bc4_0 conda-forge libblas 3.8.0 8_mkl conda-forge libcblas 3.8.0 8_mkl conda-forge libclang 9.0.1 default_hf44288c_0 conda-forge liblapack 3.8.0 8_mkl conda-forge libpng 1.6.37 h7602738_0 conda-forge libsodium 1.0.17 h2fa13f4_0 conda-forge libtiff 4.1.0 h21b02b4_3 conda-forge llvm-openmp 9.0.1 2 conda-forge lz4-c 1.8.3 he025d50_1001 conda-forge m2w64-gcc-libgfortran 5.3.0 6 m2w64-gcc-libs 5.3.0 7 m2w64-gcc-libs-core 5.3.0 7 m2w64-gmp 6.1.0 2 m2w64-libwinpthread-git 5.0.0.4634.697f757 2 markupsafe 1.1.1 py38hfa6e2cd_0 conda-forge matplotlib 3.1.2 py38_1 conda-forge matplotlib-base 3.1.2 py38h2981e6d_1 conda-forge mistune 0.8.4 py38hfa6e2cd_1000 conda-forge mkl 2019.5 281 conda-forge mkl-service 2.3.0 py38hfa6e2cd_0 conda-forge mordred 1.2.0 pyhe5148d4_0 mordred-descriptor more-itertools 8.1.0 py_0 conda-forge msys2-conda-epoch 20160418 1 nbconvert 5.6.1 py38_0 conda-forge nbformat 5.0.4 py_0 conda-forge networkx 2.4 py_0 conda-forge notebook 6.0.3 py38_0 conda-forge numpy 1.17.5 py38hc71023c_0 conda-forge olefile 0.46 py_0 conda-forge openssl 1.1.1d hfa6e2cd_0 conda-forge pandas 0.25.3 py38he350917_0 conda-forge pandoc 2.9.1.1 0 conda-forge pandocfilters 1.4.2 py_1 conda-forge parso 0.6.0 py_0 conda-forge pickleshare 0.7.5 py38_1000 conda-forge pillow 7.0.0 py38h9ea1dd6_0 conda-forge pip 20.0.2 py38_0 conda-forge pixman 0.38.0 hfa6e2cd_1003 conda-forge prometheus_client 0.7.1 py_0 conda-forge prompt_toolkit 3.0.2 py_0 conda-forge py3dmol 0.8.0 py_0 rmg pycairo 1.19.0 py38h905957f_0 conda-forge pygments 2.5.2 py_0 conda-forge pyparsing 2.4.6 py_0 conda-forge pyqt 5.12.3 py38h6538335_1 conda-forge pyqt5-sip 4.19.18 pypi_0 pypi pyqtwebengine 5.12.1 pypi_0 pypi pyrsistent 0.15.7 py38hfa6e2cd_0 conda-forge python 3.8.1 he1f5543_1 conda-forge python-dateutil 2.8.1 py_0 conda-forge pytz 2019.3 py_0 conda-forge pywin32 225 py38hfa6e2cd_0 conda-forge pywinpty 0.5.7 py38_0 conda-forge pyzmq 18.1.1 py38h16f9016_0 conda-forge qt 5.12.5 h7ef1ec2_0 conda-forge rdkit 2019.09.3 py38h422b363_0 conda-forge scikit-learn 0.22.1 py38h6288b17_0 anaconda scipy 1.3.2 py38h29ff71c_0 anaconda send2trash 1.5.0 py_0 conda-forge setuptools 45.1.0 py38_0 conda-forge six 1.14.0 py38_0 conda-forge sqlite 3.30.1 hfa6e2cd_0 conda-forge terminado 0.8.3 py38_0 conda-forge testpath 0.4.4 py_0 conda-forge tk 8.6.10 hfa6e2cd_0 conda-forge tornado 6.0.3 py38hfa6e2cd_0 conda-forge tqdm 4.42.0 py_0 conda-forge traitlets 4.3.3 py38_0 conda-forge vc 14.1 h0510ff6_4 vs2015_runtime 14.16.27012 hf0eaf9b_1 wcwidth 0.1.8 py_0 conda-forge webencodings 0.5.1 py_1 conda-forge wheel 0.34.1 py38_0 conda-forge wincertstore 0.2 py38_1003 conda-forge winpty 0.4.3 4 conda-forge xz 5.2.4 h2fa13f4_1001 conda-forge zeromq 4.3.2 h6538335_2 conda-forge zipp 2.1.0 py_0 conda-forge zlib 1.2.11 h2fa13f4_1006 conda-forge zstd 1.4.4 hd8a0e53_1 conda-forge

rgerkin commented 4 years ago

The best way is probably to do test_df.astype(float).fillna(0) which will force all the values to be floats, turning the string errors into NaNs, and then replace them with 0.

ravichas commented 4 years ago

Thanks

plkx commented 3 years ago

Are you sure you want to assign them numerical values? If their value is undefined, then assigning them a value (0, for example), may create spurious or anomalous effects in further numerical processing.

When descriptors are "missing" (not a number, undefined, infinite, etc.), either leave out the descriptor entirely, or leave out the structure(s) that do not have that descriptor value.

Such determinations are done relatively easily in Excel, or other spreadsheet applications.

Number precision of the computer platform may create some issues with numerical values, but they should not result in blank or text values. For example, I sometimes see a column with a bunch of 0 values, but some with 1 E-15. Those values are at the precision limit for numbers in Windows (treat them all as zero).

rgerkin commented 3 years ago

Most downstream ML algorithms are going to require them to have some value, and I think using such algorithms is the goal of many users of this package. Zero may not be best for the reasons you suggest (I would say that zero is "opinionated" i.e. it might be a particularly low or high value for that descriptor), so it may be better to fill them with the column median, or do some other imputation (e.g. nuclear norm).

plkx commented 3 years ago

Mathematically (as in regardless of anyone's opinion), inserting zero values where no value has been determined contradicts essential requirements for numerical treatments of data by regression, genetic algorithms or artificial neural networks, for example. There may be approaches to dealing with undefined values, but I've yet to see or hear of an instance where simply creating and inserting zero values could withstand a level of scrutiny to warrant publication in a relevant venue. A shorter name for such a practice is "making up data," which has been the reason that many publications have been recalled, recanted, or worse when the circumstances were uncovered.

I doubt that those serious about advancing machine learning advocate making up data, even if it does offer essentially instant gratification through substantial simplification of daunting challenges. Simply calculating all of the descriptors a program offers is absurd, so absurd results should be expected.

On the other hand, parsing a chemical formula for elemental composition then eliminating descriptors for elements not present adds meaningful data. Mordred expects the user to input meaningful data, select relevant descriptors to calculate, and finally, have at least some contextual understanding of the descriptors to interpret and apply them meaningfully.

To quote Albert Einstein, "It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience." Or, to paraphrase, "Everything should be made as simple as possible, but no simpler."

Conceptually, through ML a model could be "taught" the equations, definitions, evolution, and applicability of far more descriptors than any one person could master. Using that breadth & depth of knowledge, it could develop deeper understanding of molecular descriptors (the logical, algebraic and other mathematical relationships), the spectrum of molecules and their myriad properties and activities. With all of this in hand, ML would essentially take over new molecule design for pharma and beyond. I have no doubt that substantial pharma dollars have been and will continue to be directed toward these ends until they are achieved (I suspect they have achieved them in limited contexts, which will continue to expand, and it is all tightly guarded).

In the meantime, molecular descriptors are created and refined by computational scientists. Many (including me) are working on quantitative structure-activity relationships (QSAR) for molecules in biological circumstances, or quantitative structure-property relationships (QSPR) for molecules and physicochemical properties. The 2018 article introducing mordred describes mordred's capabilities, but is focused primarily on the superior quality and breadth of descriptors calculated by mordred versus PaDel descriptor, Dragon and other software packages. This is because those (including ML) engaged in QSPR or QSAR search for and rely on quality tools and quality data, because ultimately, the product is no better than its weakest component. Which means "garbage in → garbage out." "Zero insertion" for instances of undefined descriptor values contradicts the spirit of mordred promoted in their article, and would substantially diminish mordred's utility for QSAR/QSPR.

For those bothered by non-numerical values, they can be easily replaced with any values one likes in Excel, using simple find and replace.

Open-access article here: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#Sec12

In any case, the current status of this mordred incarnation appears to be "abandoned," so this may all be moot unless someone takes up the project.

The most recent mordred activity seems to be in Docker images, e.g. the XenonPy project, where they claim to intent to introduce mordred 1.2.0 soon.

Regards,

plkx

rgerkin commented 3 years ago

Yes, I conceded that zero imputation is not a good idea, much more concisely, in my previous comment. But imputation of some kind (typically not zero imputation) is almost always done on some training data in these contexts. Removing entire descriptors because a very small fraction of molecules do not have values for them reduces predictive power, even though it would be appropriate to remove them for statistical inference. These are two different goals.

plkx commented 3 years ago

Predictive power from undefined descriptors comes from the boundaries within which they are defined and undefined. Numbers, for example, may be characterized by square roots, but "square roots" of letters in the alphabet are not defined. One might try to impute numerical values, such as rank of alphabetic order, but such an approach lacks the rigorous mathematical logic applied in creation of molecular descriptors. Square roots of letters of the alphabet are not invariant properties, since one may choose different rules for ranking letters, or even different alphabets.

The availability of equations to calculate descriptors and their application to astronomical numbers of compounds which may be generated, for example, as SMILES in spreadsheets through simple concatenation operations, does not infer value to any of the data. There are infinite legitimate SMILES strings that represent molecules that could never exist in our universe because they lack chemical logic and context.

In the current case, attempting to treat undefined molecular descriptors as if they are defined seems likely to cause more damage in ML predictive models through loss of context. Failure to either recognize and/or capitalize on objective and logical contexts degrades ML outcome.

A few key points:

Mordred calculates descriptors developed/intended/validated for predominantly covalent carbon compounds. The fewer cationic or anionic atoms in a molecule, the more relevant mordred's descriptors.

This still leaves mordred able to generate descriptors for an essentially infinite number of compounds comprising only the elements hydrogen, carbon, nitrogen and oxygen. Relatively, a very small fraction of molecules that actually exist have atoms other than hydrogen, carbon, nitrogen and oxygen. Many of those don't exist at ambient temperatures, in the presence of air, or in the presence of water (including humid air).

Expansion of the set to include halogens, sulfur and phosphorus encompasses the vast majority of the molecules that exist on our planet.

The importance of organic compounds which include metal atoms probably cannot be overstated. They are also far more difficult to model, largely because of the sharp decline in the number or relevant descriptors when metal atoms are included.

Mordred explicitly specifies elements in 327 descriptors, and in 1 specifies halogens, for a total of 328 element-specific descriptors. 200 of those descriptors specify elements other than H, C, N and O. Those elements are Li, Be, B, F, Si, P, S, Cl, Ge, As, Se, Br, Sn, I and Pb. Through routine use of mordred, I found it useful to develop means to readily identify and delete irrelevant descriptors from mordred csv output files. This information is in the spreadsheet I have attached. It was derived from one of the supplemental files to the 2018 Mordred paper in the Journal of Cheminformatics. Besides the names and types of descriptors, it now provides easy identification of irrelevant descriptors, so they may readily removed by index.

There are many elements missing from that list, elements that form "significant compounds" with carbon (significant as in toxic, bioactive, high-value, etc.). Missing elements of significance include (at least) Mg, Al, Ti, V, Fe, Co, Ni, Cu, Zn, Zr, Pd, Hg and Bi.

Meaningful descriptors for carbon compounds with most elements are few or not available at all. There are many reasons this is so. For one, elements with electrons in d-orbitals are generally not-well modeled by current theories (comprising predominantly molecular mechanics, semiempirical quantum, ab initio quantum, and density functional theories). Treatments for chemical bonds involving d-orbitals, especially covalent-type, are generally specialized within very limited parameters. Furthermore, for elements in the periodic table starting around iodine and higher atomic numbers, general relativity effects increase in significance due to the higher masses of the atomic nuclei. These are non-trivial effects — gold metal is yellow and lead provides substantial electric current in batteries due to relativistic quantum effects.

Nonetheless, I encourage people continue to pursue chemistry computational and information science, and wish them more successes than failures. This is how endeavors that were considered intractable 30 years ago became almost trivial now, and so may progress continue.

Best Regards,

plkx mord_dscrptrs_addns.xlsx

rgerkin commented 3 years ago

In practice, imputation of missing values is both broadly used and improves prediction in many, many cases. This is broadly accepted in the ML community. Predictive modeling challenges--including QSAR applications-- have been won using imputation, producing models that outperform nonimputed variants. Perhaps you are unfamiliar with the literature and track record of the technique?

plkx commented 3 years ago

There is an expectation of informed application of the descriptors.

Applying descriptors calculations for lithium compounds where none exist is anything but informed.

Perhaps what's being "won" is NOT winning the race to the bottom of least valuable predictions, since it is after all, relative.

The ML literature I am familiar with espouses the value of contextual learning, and seeks to not obliterate context in the quest for ever more "data points."

Eventually, better data wins over more data.

Ragingdemo commented 6 months ago

So, what should we do with the missing data in this case? I am also facing the same issue. Thanks for your time.

JacksonBurns commented 6 months ago

@Ragingdemo for descriptors where there are no values at all, drop the column. For columns with missing values, imput the values with some algorithm, like replacing with the mean, or median, etc. etc.

Also, check out my fork of this repo that is still maintained: https://github.com/JacksonBurns/mordred-community It has support for modern Python version and fixes a number of small bugs.

Ragingdemo commented 3 weeks ago

@JacksonBurns Thank you for the response. I checked it and implemented the community version which does help. However, I have one doubt, for some chemicals , i get error as "min() arg is an empty sequence (MINssssN)" and for some descriptors as "max() arg is an empty sequence (MAXsssP)". So, should I do what you suggested that to imput some values using some algorithms. I wanted to know, that am I calculating it correctly or there are some issues with my code as listed below:

"import rdkit from rdkit import Chem from rdkit.Chem import Draw, AllChem import mordred from mordred import Calculator, descriptors import pandas as pd mol_list=[] for smiles in l: mol=Chem.MolFromSmiles(smiles) mol=Chem.AddHs(mol) mol_list.append(mol) calc=Calculator(descriptors,ignore_3D=True) mol=pd.DataFrame(mol_list) all_desc=calc.pandas(mol[0])"

where l is the list containing smiles notations Thank you for your time.

JacksonBurns commented 3 weeks ago

It's difficult to say without knowing the actual species, but some of these descriptors just aren't defined for some molecules. The exact error you get doesn't mean a whole lot unless you are familiar with how each descriptor is calculated. It is easier to simply do the imputation as you have mentioned.