mordred-descriptor / mordred

a molecular descriptor calculator
http://mordred-descriptor.github.io/documentation/master/
BSD 3-Clause "New" or "Revised" License
355 stars 95 forks source link

Hardening mordred software with PubChem SDF files #61

Open tobigithub opened 6 years ago

tobigithub commented 6 years ago

It would be important to solve some of the calculational errors by testing mordred with a larger and diverse repository such as PubChem. The files can be obtained here:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/

Running the first 1 million or 10 million molecules will reveal many current issues, that later can be fixed. Mordred is fast enough to handle such an amount relatively quickly.

python -m mordred -t sdf Compound_000000001_000025000.sdf -o Compound_000000001_000025000.csv

WARNING:root:mol read failure: Compound_000000001_000025000.22903
[18:02:01] Explicit valence for atom # 0 Cl, 3, is greater than permitted
[18:02:01] ERROR: Could not sanitize molecule ending on line 4241268
[18:02:01] ERROR: Explicit valence for atom # 0 Cl, 3, is greater than permitted

WARNING:root:mol read failure: Compound_000000001_000025000.22933
[ERROR] 783: float division by zero (ETA_shape_p)
[ERROR] 783: index 0 is out of bounds for axis 0 with size 0 (ETA_eta_R/ReferenceMol)
[ERROR] 783: division by zero (ETA_epsilon_3/ReferenceMolH)
[ERROR] 1038: float division by zero (ETA_shape_p)
[ERROR] 1038: division by zero (AETA_eta_R)
[ERROR] 1038: tuple index out of range (ETA_epsilon_5)

[ERROR] 1038: tuple index out of range (ETA_epsilon_5)
 31%|██████████████████████████▍                                                          
| 7249/23253   [08:50<13:51, 19.25it/s]

[18:10:51] Can't kekulize mol.  Unkekulized atoms: 1 3 4 7 8

Traceback (most recent call last):
  File ".../lib/python3.6/site-packages/mordred/_base/context.py", line 39, in from_query
    m = Chem.AddHs(mol) if eh else Chem.RemoveHs(mol)
ValueError: Sanitization error: Can't kekulize mol.  Unkekulized atoms: 1 3 4 7 8

While some of the errors are captured, they are not resolved yet, such as the case for the division by zeros, or tuples out of range. For the failed kekulization, the compound should be kept, but the values in the descriptor matrix should be all NA.

philopon commented 6 years ago

Thank you for your suggestion!

It would be important to solve some of the calculational errors by testing mordred with a larger and diverse repository such as PubChem.

I completely agree with your opinion. I'll check mordred by large data in future.

I checked your reported log and I found bug in ReferenceMol. So I open new issue #64. Thank you.

For the failed kekulization, the compound should be kept, but the values in the descriptor matrix should be all NA.

I don't think so. Some kekulilation failure are from issue of RDKit (not structural issue). In such case, users may want calculate descriptors that don't use kekule information. However, of course, I understand such use case. I'll consider about it.

Thanks,

xiaoboy185 commented 1 year ago

Dear tobigithub Do you have solved the following issue?

WARNING:root:mol read failure: Compound_000000001_000025000.22903 [18:02:01] Explicit valence for atom # 0 Cl, 3, is greater than permitted [18:02:01] ERROR: Could not sanitize molecule ending on line 4241268 [18:02:01] ERROR: Explicit valence for atom # 0 Cl, 3, is greater than permitted

Recently, I also meet the same problem, the valence for atom B, 4, is greater than permitted. However, such structure is reasonable. I am looking forward to your message. Thanks!

The following is the issue that I have posted recently:

Dear mordred users

when i run the following command line: python -m mordred -t sdf clusters.sdf -o clusters.csv

the error is [10:58:35] Explicit valence for atom # 11 B, 4, is greater than permitted [10:58:35] ERROR: Could not sanitize molecule ending on line 68 [10:58:35] ERROR: Explicit valence for atom # 11 B, 4, is greater than permitted WARNING:root:mol read failure: clusters

The following is the sdf file

clusters OpenBabel01042310313D

30 33 0 0 1 0 0 0 0 0999 V2000 -1.5093 -3.2766 0.6653 O 0 0 0 0 0 1 0 0 0 0 0 0 -1.2712 -2.1184 0.4200 C 0 0 2 0 0 3 0 0 0 0 0 0 0.0521 -1.6942 0.3007 N 0 0 0 0 0 0 0 0 0 0 0 0 0.4728 -0.4370 0.0305 C 0 0 2 0 0 3 0 0 0 0 0 0 -0.4104 0.5272 -0.1484 N 0 0 1 0 0 0 0 0 0 0 0 0 -1.7723 0.2605 -0.0605 C 0 0 2 0 0 3 0 0 0 0 0 0 -2.7232 1.2681 -0.2460 C 0 0 0 0 0 3 0 0 0 0 0 0 -4.0685 0.9799 -0.1540 C 0 0 0 0 0 3 0 0 0 0 0 0 -4.5139 -0.3091 0.1234 C 0 0 0 0 0 3 0 0 0 0 0 0 -3.5936 -1.3146 0.3094 C 0 0 0 0 0 3 0 0 0 0 0 0 -2.2277 -1.0413 0.2199 C 0 0 2 0 0 3 0 0 0 0 0 0 0.0653 2.0379 -0.4730 B 0 0 0 0 0 0 0 0 0 0 0 0 -0.4754 2.3781 -1.6869 F 0 0 0 0 0 0 0 0 0 0 0 0 -0.4206 2.8370 0.5306 F 0 0 0 0 0 0 0 0 0 0 0 0 1.5032 2.1309 -0.5280 O 0 0 0 0 0 0 0 0 0 0 0 0 2.3278 1.1339 -0.3398 C 0 0 2 0 0 3 0 0 0 0 0 0 3.7066 1.3956 -0.4267 C 0 0 0 0 0 3 0 0 0 0 0 0 4.6246 0.3948 -0.2401 C 0 0 0 0 0 3 0 0 0 0 0 0 4.2009 -0.9055 0.0405 C 0 0 0 0 0 3 0 0 0 0 0 0 2.8612 -1.1854 0.1302 C 0 0 0 0 0 3 0 0 0 0 0 0 1.8909 -0.1819 -0.0558 C 0 0 2 0 0 3 0 0 0 0 0 0 0.7462 -2.4201 0.4344 H 0 0 0 0 0 0 0 0 0 0 0 0 -2.4090 2.2729 -0.4615 H 0 0 0 0 0 0 0 0 0 0 0 0 -4.7852 1.7733 -0.3009 H 0 0 0 0 0 0 0 0 0 0 0 0 -5.5704 -0.5168 0.1918 H 0 0 0 0 0 0 0 0 0 0 0 0 -3.8909 -2.3292 0.5265 H 0 0 0 0 0 0 0 0 0 0 0 0 4.0066 2.4078 -0.6444 H 0 0 0 0 0 0 0 0 0 0 0 0 5.6798 0.6103 -0.3096 H 0 0 0 0 0 0 0 0 0 0 0 0 4.9263 -1.6901 0.1868 H 0 0 0 0 0 0 0 0 0 0 0 0 2.5714 -2.2028 0.3489 H 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 0 0 0 3 2 1 0 0 0 0 3 22 1 0 0 0 0 4 3 1 6 0 0 0 5 12 1 1 0 0 0 5 6 1 0 0 0 0 5 4 1 0 0 0 0 6 7 1 1 0 0 0 6 11 1 0 0 0 0 7 8 1 0 0 0 0 8 9 1 0 0 0 0 9 25 1 0 0 0 0 9 10 1 0 0 0 0 10 26 1 0 0 0 0 11 10 1 1 0 0 0 11 2 1 0 0 0 0 12 14 1 0 0 0 0 13 12 1 0 0 0 0 15 12 1 0 0 0 0 16 15 1 6 0 0 0 16 21 1 0 0 0 0 17 16 1 0 0 0 0 17 18 1 0 0 0 0 18 19 1 0 0 0 0 19 20 1 0 0 0 0 19 29 1 0 0 0 0 20 30 1 0 0 0 0 21 4 1 0 0 0 0 21 20 1 6 0 0 0 23 7 1 0 0 0 0 24 8 1 0 0 0 0 27 17 1 0 0 0 0 28 18 1 0 0 0 0 M END $$$$