Get compounds function is too slow - Githubissues

mcs07 / PubChemPy

Python wrapper for the PubChem PUG REST API.

http://pubchempy.readthedocs.io

MIT License

379 stars 106 forks source link

Get compounds function is too slow #84

Closed sumone-compbio closed 5 months ago

sumone-compbio commented 6 months ago

Hi, it takes an average of 2 minutes to fetch a compound's SMILES from its IUPAC name. Is it really this slow or is there something wrong with my code below: import pubchempy as pcp

def get_smiles_from_iupac(iupac_name):
    compounds = pcp.get_compounds(iupac_name, 'name')

    if compounds:
        smiles = compounds[0].isomeric_smiles if compounds[0].isomeric_smiles else compounds[0].canonical_smiles
        return smiles
    else:
        return None

Thank you

nbehrnd commented 6 months ago

@sumone-compbio No, there is something beyond your function slowing down the data process.

For an easier testing, I expanded your function into a larger script:

#!/usr/bin/env python3
"""
name     : runtime.py
purpose  : meter the time of a pubchem query with pubchempy
date     : [2024-03-11 Mon]
"""
import argparse
import time

import pubchempy as pcp

def get_args():
    """get the command-line arguments"""
    parser = argparse.ArgumentParser(
        description="meter the time of a pubchem query with pubchempy",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )

    parser.add_argument(
    "name",
    help="provide a chemical name"
    )

    return parser.parse_args()

def get_smiles_from_iupac(iupac_name):
    """the shared test function"""
    compounds = pcp.get_compounds(iupac_name, 'name')

    if compounds:
        smiles = compounds[0].isomeric_smiles if compounds[0].isomeric_smiles else compounds[0].canonical_smiles
        return smiles
    else:
        return None

def main():
    """join the functionalities"""

    args = get_args()
    chemical_name = args.name

    time_0 = time.time()
    smiles = get_smiles_from_iupac(chemical_name)
    print(smiles)

    time_1 = time.time()
    print(f"time elapsed (s): {(time_1 - time_0):.2f}")

if __name__ == "__main__":
    main()

In a virtual environment, amended by pip install pubchempy to resolve its dependency, and a chmod +x ./runtime.py to provide the executable bit, I received the answers in a fraction of a second each, e.g.

$ ./runtime.py benzene
C1=CC=CC=C1
time elapsed (s): 0.58
$ ./runtime.py pyridine
C1=CC=NC=C1
time elapsed (s): 0.56
$ ./runtime.py water
O
time elapsed (s): 0.52
$ ./runtime.py glucose
C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O
time elapsed (s): 0.52
$ ./runtime.py xylit
C([C@H](C([C@H](CO)O)O)O)O
time elapsed (s): 0.54
$ ./runtime.py iron
[Fe]
time elapsed (s): 0.54
$ ./runtime.py "sodium chloride"
[Na+].[Cl-]
time elapsed (s): 0.76

nbehrnd commented 6 months ago

@sumone-compbio Without additional context shared by you, perhaps you sent too many requests to PubChem with too little delay (as outlined in PubChem's documentation about a programmatic access of the database).

sumone-compbio commented 6 months ago

@nbehrnd I have around 2000 compounds on my list. Is there a faster way to fetch the smiles from their IUPAC names?

sumone-compbio commented 6 months ago

Also, I tried the code above but it made no difference, unfortunately. Did you try my code in your system? Did it take 2 minutes for a compound to fetch the smiles?

nbehrnd commented 6 months ago

@sumone-compbio Perhaps this is more efficiently managed as bulk download said page mentions in the red box (and then linking how to address this efficiently). Perhaps PUG-SOAP and its bindings to Python is a separate interface to consider, too. So far, I did not use either one of the two.

I presume PubChem's aim is to prevent a (distributed) denial-of-service attack by an anonymous user and thus installed "a throttle" to constrain the traffic. In a way similar to the GitHub API with up to 60 unauthenticated requests per hour on one hand, but e.g., up to 5k requests/h if you use a personal access token.

Outside PubChem, maybe NIH grants permission to mirror the database to reduce the load of traffic on their infrastructure (pub in PubChem as in public?). An example of which I know it is possible is the crystallographic open database, COD initiated by researches at Vilnius' university (Lithuania); here, mirroring is encouraged (list of known participants, as I just note their principal landing page is down).

sumone-compbio commented 6 months ago

@nbehrnd I see it now. Thank you so much. However, even for a single compound e.g. aspirin it took 2 minutes using my code.

nbehrnd commented 6 months ago

@sumone-compbio The test script uses your function (the only edit to it was the addition of the doc string). This "envelope" adds some work for the Python interpreter, but as this is run locally, I do not think this explains a request of about 0.5 s now takes 2 s.

nbehrnd commented 6 months ago

@sumone-compbio Two minutes (as in 120 seconds) now ... this sounds very much a like throttle. Indeed something I experienced with GitHub API without an access token last week, processing a list of entries: once "the next request by this IP address is delayed for 50s" (which at this time still was faster than fetching the data with a normal request/beautiful soup4). But when I came back, the penalty was substantially increased ("wait an hour for the next request to be processed by GitHub"). At this point, I generated this token, and modified my script accordingly.

sumone-compbio commented 5 months ago

@nbehrnd it was my campus wifi that made the process slow. I tried it with a connection outside the campus and it worked as you reported. Thank you so much.