mcs07 / PubChemPy

Python wrapper for the PubChem PUG REST API.
http://pubchempy.readthedocs.io
MIT License
379 stars 106 forks source link

Errors in the encoding of some bits #70

Open salvogalati opened 1 year ago

salvogalati commented 1 year ago

Hi, I was checking the code behind the PubChem fingerprint generation. I did some comparisons between fingerprints calculated with your code and those calculated with PyFingerprint which uses the cdk library and noticed some differences. I noticed that for bits in the range 0-98, smarts are not used and therefore when carbons are counted for example, only aliphatic carbons are considered since the corresponding key is C. As a result the counting and encoding are incorrect. The second point concerns the bits in the range 115-231: in this case there are two conditions to be met such as bits 116 and 117 mention ">= 1 saturated or aromatic carbon-only ring size 3 " and ">= 1 saturated or aromatic nitrogen-containing ring size 3" respectively. In this case a cyclopropane ring should be detected by bit 116 but not by bit 117. Instead with your code it is encoded for both bits.

I hope the bugs I reported are corrected otherwise I would be glad to have an explanation of my mistake

Thank you for your helpfulness Salvatore

nbehrnd commented 1 year ago

I did some comparisons between fingerprints calculated with your code and those calculated with PyFingerprint which uses the cdk library and noticed some differences.

If this is your observation, consider to provide the data used to perform the test to attempt a replication of your findings. Then, the output by pubchempy and pyfingerprint are easier to compare with each other (e.g., a diff view of the corresponding logs) to resolve discrepancies and correct errors.

If rising an issue in GitHub, you may substantiate your findings by attaching a file; to get familiar with this option, hoover the mouse at the lower rim of the frame of the input mask. This may be a text file, a log, or e.g., a python script -- as long as it gets the file extension .txt, GitHub will permit it. Especially if it is a larger file (e.g., a .sdf container-like file about many molecular structures), or a collection of files, an often useful alternative is a .zip archive. Out of courtesy, include a brief descriptive readme (what setup was used [OS, which version of Python, pubchempy and pyfingerprint engaged, etc), too.