Open tobigithub opened 6 years ago
Thank you for your suggestion!
It would be important to solve some of the calculational errors by testing mordred with a larger and diverse repository such as PubChem.
I completely agree with your opinion. I'll check mordred by large data in future.
I checked your reported log and I found bug in ReferenceMol. So I open new issue #64. Thank you.
For the failed kekulization, the compound should be kept, but the values in the descriptor matrix should be all NA.
I don't think so. Some kekulilation failure are from issue of RDKit (not structural issue). In such case, users may want calculate descriptors that don't use kekule information. However, of course, I understand such use case. I'll consider about it.
Thanks,
Dear tobigithub Do you have solved the following issue?
WARNING:root:mol read failure: Compound_000000001_000025000.22903 [18:02:01] Explicit valence for atom # 0 Cl, 3, is greater than permitted [18:02:01] ERROR: Could not sanitize molecule ending on line 4241268 [18:02:01] ERROR: Explicit valence for atom # 0 Cl, 3, is greater than permitted
Recently, I also meet the same problem, the valence for atom B, 4, is greater than permitted. However, such structure is reasonable. I am looking forward to your message. Thanks!
The following is the issue that I have posted recently:
Dear mordred users
when i run the following command line: python -m mordred -t sdf clusters.sdf -o clusters.csv
the error is [10:58:35] Explicit valence for atom # 11 B, 4, is greater than permitted [10:58:35] ERROR: Could not sanitize molecule ending on line 68 [10:58:35] ERROR: Explicit valence for atom # 11 B, 4, is greater than permitted WARNING:root:mol read failure: clusters
The following is the sdf file
clusters OpenBabel01042310313D
30 33 0 0 1 0 0 0 0 0999 V2000 -1.5093 -3.2766 0.6653 O 0 0 0 0 0 1 0 0 0 0 0 0 -1.2712 -2.1184 0.4200 C 0 0 2 0 0 3 0 0 0 0 0 0 0.0521 -1.6942 0.3007 N 0 0 0 0 0 0 0 0 0 0 0 0 0.4728 -0.4370 0.0305 C 0 0 2 0 0 3 0 0 0 0 0 0 -0.4104 0.5272 -0.1484 N 0 0 1 0 0 0 0 0 0 0 0 0 -1.7723 0.2605 -0.0605 C 0 0 2 0 0 3 0 0 0 0 0 0 -2.7232 1.2681 -0.2460 C 0 0 0 0 0 3 0 0 0 0 0 0 -4.0685 0.9799 -0.1540 C 0 0 0 0 0 3 0 0 0 0 0 0 -4.5139 -0.3091 0.1234 C 0 0 0 0 0 3 0 0 0 0 0 0 -3.5936 -1.3146 0.3094 C 0 0 0 0 0 3 0 0 0 0 0 0 -2.2277 -1.0413 0.2199 C 0 0 2 0 0 3 0 0 0 0 0 0 0.0653 2.0379 -0.4730 B 0 0 0 0 0 0 0 0 0 0 0 0 -0.4754 2.3781 -1.6869 F 0 0 0 0 0 0 0 0 0 0 0 0 -0.4206 2.8370 0.5306 F 0 0 0 0 0 0 0 0 0 0 0 0 1.5032 2.1309 -0.5280 O 0 0 0 0 0 0 0 0 0 0 0 0 2.3278 1.1339 -0.3398 C 0 0 2 0 0 3 0 0 0 0 0 0 3.7066 1.3956 -0.4267 C 0 0 0 0 0 3 0 0 0 0 0 0 4.6246 0.3948 -0.2401 C 0 0 0 0 0 3 0 0 0 0 0 0 4.2009 -0.9055 0.0405 C 0 0 0 0 0 3 0 0 0 0 0 0 2.8612 -1.1854 0.1302 C 0 0 0 0 0 3 0 0 0 0 0 0 1.8909 -0.1819 -0.0558 C 0 0 2 0 0 3 0 0 0 0 0 0 0.7462 -2.4201 0.4344 H 0 0 0 0 0 0 0 0 0 0 0 0 -2.4090 2.2729 -0.4615 H 0 0 0 0 0 0 0 0 0 0 0 0 -4.7852 1.7733 -0.3009 H 0 0 0 0 0 0 0 0 0 0 0 0 -5.5704 -0.5168 0.1918 H 0 0 0 0 0 0 0 0 0 0 0 0 -3.8909 -2.3292 0.5265 H 0 0 0 0 0 0 0 0 0 0 0 0 4.0066 2.4078 -0.6444 H 0 0 0 0 0 0 0 0 0 0 0 0 5.6798 0.6103 -0.3096 H 0 0 0 0 0 0 0 0 0 0 0 0 4.9263 -1.6901 0.1868 H 0 0 0 0 0 0 0 0 0 0 0 0 2.5714 -2.2028 0.3489 H 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 0 0 0 3 2 1 0 0 0 0 3 22 1 0 0 0 0 4 3 1 6 0 0 0 5 12 1 1 0 0 0 5 6 1 0 0 0 0 5 4 1 0 0 0 0 6 7 1 1 0 0 0 6 11 1 0 0 0 0 7 8 1 0 0 0 0 8 9 1 0 0 0 0 9 25 1 0 0 0 0 9 10 1 0 0 0 0 10 26 1 0 0 0 0 11 10 1 1 0 0 0 11 2 1 0 0 0 0 12 14 1 0 0 0 0 13 12 1 0 0 0 0 15 12 1 0 0 0 0 16 15 1 6 0 0 0 16 21 1 0 0 0 0 17 16 1 0 0 0 0 17 18 1 0 0 0 0 18 19 1 0 0 0 0 19 20 1 0 0 0 0 19 29 1 0 0 0 0 20 30 1 0 0 0 0 21 4 1 0 0 0 0 21 20 1 6 0 0 0 23 7 1 0 0 0 0 24 8 1 0 0 0 0 27 17 1 0 0 0 0 28 18 1 0 0 0 0 M END $$$$
It would be important to solve some of the calculational errors by testing mordred with a larger and diverse repository such as PubChem. The files can be obtained here:
ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
Running the first 1 million or 10 million molecules will reveal many current issues, that later can be fixed. Mordred is fast enough to handle such an amount relatively quickly.
While some of the errors are captured, they are not resolved yet, such as the case for the division by zeros, or tuples out of range. For the failed kekulization, the compound should be kept, but the values in the descriptor matrix should be all NA.