mcs07 / PubChemPy

Python wrapper for the PubChem PUG REST API.
http://pubchempy.readthedocs.io
MIT License
388 stars 108 forks source link

Inconsistencies in Isomeric SMILES Data Retrieval Using PubChemPy Compared to Previous Data #87

Open kyokim-mpu opened 4 months ago

kyokim-mpu commented 4 months ago

Dear Developer,

Hello.

We use PubChemPy to obtain isomeric SMILES for our daily research. It has been very helpful, and I greatly appreciate it.

To ensure the accuracy of the data when obtaining SMILES, I would like to ask a question based on a discrepancy I noticed when comparing data from the previous year.

■Events When I obtained isomeric SMILES from compound names using PubChemPy, there was a difference between 2022 and 2023. Specifically, 76 compounds could be retrieved in FY2022 but not in FY2023.

When we searched and compared the compound names and SMILES obtained in FY2022 with the current data from PubChem, we observed the following two patterns:

  1. Search results appeared, but some SMILES were not retrieved.  Example: POLIDOCANOL and CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCOCCOCCO  Example: METHYLENEDIOXYMETHAMPHETAMINE and CC(CC1=CC2=C(C=C1)OCO2)NC  Example: SITOSTEROL and CCC@HC)C)C(C)C

  2. Search results appeared, and SMILES were retrieved, but the information was neither the best match nor relevant.  Example: CHROME ALUM and OS(=O)(=O)O.OS(=O)(=O)[O-].[K+].[Cr]  Example: EGG YOLK and CCCCCCCCC/C=C/CCCCCCCC(=O)OCC(COP(=O)(O)OCCN+(C)C)OC(=O)CCCCCCC/C=C/CCCCCCCCC  Example: BROMELAINS and CCCC(C)C1(C(=O)NC(=O)N=C1[O-])CC.[Na+]

■Question I have questions because I cannot confirm the sequential data changes from FY2022 to FY2023.

  1. Is the acquisition process different when searching PubChem by compound name and when using PubChemPy's API, and do the results differ? If they do, we believe the discrepancies above could be due to this difference.

  2. Could it simply be that the data was updated between 2022 and 2023, and thus, certain compounds could not be retrieved? If the acquisition process is the same, we believe it is simply due to data updates.

Are there any other possible causes or ways to confirm this?

I really appreciate any help you can provide.