mordred-descriptor / mordred

a molecular descriptor calculator
http://mordred-descriptor.github.io/documentation/master/
BSD 3-Clause "New" or "Revised" License
340 stars 91 forks source link

Error while calculating 3D descriptors: missing 3D coordinate (RNCS/RNCG/AtomicCharge/Propc/AtomicSurfaceArea) #93

Open cartilage-ftw opened 3 years ago

cartilage-ftw commented 3 years ago

Description

I'm trying to calculate a whole bunch of descriptors, including some 3D descriptors using a set of SMILES. It doesn't give me any values for RASA, TASA, TPSA, etc., just an error in place of the values "missing 3D coordinate".

Code

I am loading a bunch of smiles and making a list of RDKit Mol objects of the corresponding SMILES molecules_list = [] After doing

desc_needed = ['SIC0', 'IC0', 'CIC0', 'nRot', 'nN', 'nH', 'nC', 'nS', 'nO', 'nHBDon', 'nHBAcc',
               'GeomDiameter', 'TopoPSA', 'SLogP', 'RASA', 'TASA', 'TPSA', 'RNCS', 'RPCS', 'RPSA']
calc = Calculator(descriptors, ignore_3D=False)
calc.descriptors = [d for d in calc.descriptors if str(d) in desc_needed]
result = calc.pandas(molecules_list)

I get the following output

https://i.imgur.com/vBwU12N.png

The particular text output for RASA is,

missing 3D coordinate (RNCS/RNCG/AtomicCharge/Propc/AtomicSurfaceArea)

In case you need some of my SMILES for reproducing this

C(C(CC(C(CO)O)O)=O)=O
C(CC(C(C(CO)O)O)=O)=O
C(C(C(CC(CO)O)=O)O)=O
C(CC(C=O)O)(C(CO)O)=O
C(C(C(C=O)O)O)(CCO)=O
C(C(C(CC(CO)=O)O)O)=O
C(C(C(C(C(C)=O)O)O)O)=O
C(CC(C(C(C=O)O)O)O)=O
C(CO)=O
C(C(C(CO)O)O)=O
C(CO)(C(C(C(CO)O)O)O)=O
C(C(C1C(C(C(O)O1)O)O)O)O
C(C1C(C(C(C(O)O1)O)O)O)O
C1C(C(C(C(C(O)O1)O)O)O)O

Environment

OS/distribution

Manjaro KDE Plasma Kernel: 5.8.18-1-MANJARO

conda or pip

Using conda (an environment called my-rdkit-env)

python version

Python 3.7.9

library version

Please execute the command and paste result.

plkx commented 3 years ago

You have to provide the 3D structures if you want to calculate 3D descriptors with Mordred. I've found DataWarrior to be an exceptionally easy to use free software package for going from smiles to 2D and/or 3D structures for LARGE compound sets.

Paul

cartilage-ftw commented 3 years ago

You have to provide the 3D structures if you want to calculate 3D descriptors with Mordred. I've found DataWarrior to be an exceptionally easy to use free software package for going from smiles to 2D and/or 3D structures for LARGE compound sets.

Paul

Hey, thank you for your reply, Paul. But how do you provide 3D molecules to Mordred using RDKit? DataWarrior is great but I'm not sure if it will cover the entire list of descriptors I need to calculate.

plkx commented 3 years ago

Put your smiles in a text document (attachment #1).

Open the text document in DWarrior.

Save it (default save, as DWAR file). This is not essential, but makes future work easier, such as adding a name column. (attachment #2, after deleting the .txt extension because github does not allow much).

Generate 2D atom coordinates ( Chemistry → Generate 2D Atom Coordinates…)

Generate 3D structures (Chemistry → Generate Conformers…; Leave Max. Conformer Count = 1; DO NOT CHECK SAVE TO FILE). In this case, I chose the systematic, low energy bias algorithm; initial torsions from the crystallographic database; and minimized energy using MMFF94s+ forcefield. That took ~3 seconds on my laptop, but can take minutes or hours for large compound sets, e.g. recently took ~20 minutes for an 1800 cmpd set. You can choose not to minimize for fastest results.

Save changes.

Now save it as an SDF (File → Save Special → SD-File). A dialog pops up. Click "save" under "MDL SD-files (.sdf)." In the next dialog: (1) leave Structure Column as is; (2) change SD-file version to version 2 (Mordred may not like version 3 files); (3) change Atom Coordinates form 2D to "3D if available"; (4) choose you compound name column from the dropdown. Keep in mind program limitations (e.g. names with commas will give garbled output from Mordred if you choose csv file output). I used the smiles you provided as the compound name column. (attachment #3, after deleting the .txt extension because github does not allow much).

The new SDF contains 3D structures of your smiles. Note: your smiles did not specify stereoisomers, but a 3D structure requires such specificity. Data Warrior creates one stereoisomer per smiles in this case, as indicated by the usual "R" & "S" atom labels.

The attached files include (1) your smiles in a text file (2) the DWar file ready to save as an SDF and (3) a pruned SDF of 3D structures from your smiles. Pruning involved deleting columns such as "minimization energy" & smiles.

This is barely the beginning of what you can do using DataWarrior - I invite you to read the better-than-average (for free software) documentation with DataWarrior for the good stuff.

Good Luck,

Paul

smiles_cartilage-ftw.txt

Remove the .txt extension on this file to get the DWAR file: smiles_cartilage-ftw.dwar.txt

Remove the .txt extention on this file to get the SDF: smiles_cartilage-ftw_pruned.sdf.txt

plkx commented 3 years ago

The SDF file above is your input file for Mordred, as in

$ python -m mordred smiles_cartilage-ftw_pruned.sdf -o MORD_cartilage-ftw_pruned.csv

You supply the output file name (after -o). In this example, I chose MORD_cartilage-ftw_pruned.csv

plkx commented 3 years ago

An SDF serves as the input file for Mordred.

It is entirely possible to generate 3D structures from smiles using RDKit.

Personally, I find the DataWarrior GUI preferable, as it has extensive manipulation and visualization capabilities.

In the case exemplified above, a combination using the rdkit.Chem.EnumerateStereoisomers module to generate all possible stereoisomer SMILES (saved in a text file) can then be imported (opened) in DataWarrior. Elimination of duplicates is trivial in DataWarrior (Data → Delete Rows → Duplicate Rows…).

On another note, I do further molecular modeling for QM properties. I have found that DWar provides 3D structures that require less curation/refinement during geometry and energy minimization by semiempirical methods. Sometimes, unminimized 3D structures using torsions from the crystallographic database provide best starting structures. This is especially the case for sets of homologous molecular structures because the initial conformers retain more apparent homolog consistency.

Of course, these are my personal experiences with molecule sets I have studied, versus either Open Babel or RDKit, so your mileage may vary.

Paul

plkx commented 3 years ago

Window capture from DataWarrior with all 3D structures. DWARCapture

plkx commented 3 years ago

Oops - I moused over a structure that was not selected, causing the 3D structure above to be for the compound above the highlighted heptacyclo compound. Here is the a snip showing the same compound in all panes. DWAR_Capture

ky66 commented 3 years ago

Yeah Paul I tried this and it is missing all the 3D features still. It is supposed to have 1800+ features. Your SDF gives 1614 features only.

xdn-github commented 1 year ago

Hi! Maybe it's toooo late to answer this. And for the questioner the problem may have been solved. But I would likely to write my solution for the problem to help new comers.

IN mordred 1.2.0 you CAN NOT get all the descriptos by using code below:

$ python -m mordred [molecular.sdf] -o [molecular_descriptors.csv] you only get 1614 descriptors as ky66 mentioned, which excludes 3D descriptors.

what you should do is using code like this:

$ python -m mordred -3 [molecular.sdf] -o [molecular_descriptors.csv] just add the "-3" in it! And you can get all descriptors result in [molecular_descriptors.csv]

the whole sample code you can try is like this:

from mordred import Calculator, descriptors
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd
import os

# using RDkit create 3D sdf file or you can uising DataWar as mentioned ahead
smiles = 'CCCC(=O)O[C@@H]1C[C@H]2C=CC[C@@H]21'
mol = AllChem.AddHs(Chem.MolFromSmiles(smiles))
AllChem.EmbedMolecule(mol)
AllChem.MMFFOptimizeMolecule(mol)
Chem.MolToMolFile(mol,  "test.sdf")

# create all descriptors using sdf file
os.system("python -m mordred -3 test.sdf -o test.csv")
# if you use jupyter the code ahead can be"!python -m mordred -3 test.sdf -o test.csv"

# you can check the descriptors files as blow in jupyter
df = pd.read_csv('test.csv')
df.head()

Thanks @ky66 a lot! I find this solution in his assay. In the end, I think it's ok to close this issue?