Future of the pypath.share.curl module

MelihDarcanxyz commented 1 year ago

There should be an async option since some API request take long to respond and we may have 14000 more requests to make which is a real case in KEGG database. I tried to implement __await__ method to Curl class but I wasn't successfull as the curl module is monolithic and obsolete.

Other than that, there are so many arbitrary options and redundant code in Curl. I think it violates simplicity rule. Do one thing and do it well. I expect it to just connect to web and return a response, so I can handle the outcome. In contrast, this module does so many complex things that shouldn't be its responsibility and it doesn't fit to everything. I myself struggled with a different problem on every new database I tried to add. There were some connections which couldn't be made with our curl module but 3 different libraries which are: external pycurl library, external requests library, standart urllibs library.

And for example aiohttp would do nicely for async calls. I understand that there may be a need for caching, I don't know caching capabilities of this libraries but I think we should keep things small and modular.

I said 'I' the whole issue and this resembles my ideas but this topic has bee discussed in a meeting with @slobentanzer and our professor. We agree that this topic may need another meeting.

So, is curl going to be replaced in the near future, how should we start to do the change, should we continue using it?

slobentanzer commented 1 year ago

I agree that this is an important topic for pypath "v2"; just my two cents: collaborators have built a download / caching application named Buffy, basically a smart downloader with file caching functionality, which in my eyes would be suitable for managing pypath downloads. We could try whether it works for our use case and maybe add missing functionality there, directly. This could replace download and temp file handling of pypath.

I am currently busy with finalising the BioCypher manuscript, so if there is immediate need to process something (eg these KEGG data), then I suggest adding a preprocessing module that gets the data independently and then work on the aggregated data. It may not be worth modifying the curl submodule to do async if we want to replace it in the medium term. Also, the KEGG problem may not be general enough to warrant implementing something as extensive as this change.

slobentanzer commented 1 year ago

Talked briefly about it with Tim @motey, the Buffy maintainer. We could implement async functionality there.

motey commented 1 year ago

Buffy has a server/client architecture and in principle async calls are possible. Maybe a batch API in the client would be useful. @MelihDarcanxyz could you provide some use-cases, examples, urls i can improve/adpat Buffy around?

Very simplified, Buffy is a wrapper around the python requests module. Until now this libray worked for me in all http resource cases.

deeenes commented 1 year ago

Hi, thanks for looking at this, it's indeed something we should address soon. The curl module in pypath is very old, not well designed and maintained, hence no surprise the code is convoluted and difficult to use.

Buffy looks great, it's definitely an option we should consider. It has many of the features that we need, maybe we have to do some adjustments, but that might be cheaper than developing from scratch.

The features that we need is a full control over all HTTP parameters, GET, POST, multipart forms, binary payload, HTTP headers (both request & response), cookies, 30x redirects, disabling ALPN, downgrade to HTTP 1.1, retries, timeouts, FTP.

And the two most important:

The library should not add headers and do session management on its own, or even better if it does, but can be disabled.

It should be able to provide detailed debug traces, like curl -vvv.

Looking at Buffy, two important points: can we start the server automatically, and without docker? And same for redis? Also, if redis is a software to be independently installed, that complicates our installation, shall we add a Python key-value storage backend?

@motey Maybe based on the points above, you can advise how suitable Buffy is for the purpose?

Originally I would have preferred to have the cache management in a completely separate module. We should still decide about this.

slobentanzer commented 1 year ago

I was thinking that Buffy could be this "separate module". Depending on how aligned our feature wishes are with what Tim has in mind for Buffy, we could just contribute there. Or we could write a shallow wrapper around Buffy that adds our required features.

deeenes commented 1 year ago

@slobentanzer I talked about having the download manager and cache manager modules separated

slobentanzer commented 1 year ago

@deeenes I see, got it

MelihDarcanxyz commented 1 year ago

Buffy has a server/client architecture and in principle async calls are possible. Maybe a batch API in the client would be useful. @MelihDarcanxyz could you provide some use-cases, examples, urls i can improve/adpat Buffy around?

Very simplified, Buffy is a wrapper around the python requests module. Until now this libray worked for me in all http resource cases.

Of course:


def pharos_general(query, variables=None):

    url = "https://pharos-api.ncats.io/graphql"

    req_headers = {
        "Accept-Encoding": "gzip, deflate, br",
        "Content-Type": "application/json",
        "Connection": "keep-alive",
        "DNT": "1",
        "Origin": "https://pharos-api.ncats.io",
    }

    if variables:

        binary_data = json.dumps(
            {
                'query': query,
                'variables': variables
            }
        )

    else:

        binary_data = json.dumps(
            {
                'query': query,
            }
        )

    binary_data = binary_data.encode('utf-8')

    c = Curl(
        url=url,
        req_headers=req_headers,
        binary_data=binary_data,
        compressed=True,
        compr='gz',
    )
    result = json.loads(c.result)['data']

    return result

def get_targets(chunk_size=100):
    step = 0
    result = list()
    variables = {
        'chunk_size': chunk_size,
        'step': step
    }

    query = '''
        query targetDetails($chunk_size: Int! $step: Int!) {

            targets {

                targets(top: $chunk_size skip: $step) {

                    name
                    sym
                    uniprot

                    expressions(top: 10000) {

                        expid
                        type
                        tissue
                        value

                        uberon {
                            name
                            uid
                        }

                        pub {
                            pmid
                        }
                    }

                    gtex {

                        tissue
                        tpm
                        log2foldchange

                        uberon {
                            name
                            uid
                        }
                    }

                    orthologs(top: 10000) {
                        name
                        species
                        orid
                        dbid
                        geneid
                        source
                    }

                    ligands(top: 10000 isdrug: true) {

                        ligid
                        name

                        activities(all: true) {
                            actid
                            type
                            moa
                            value
                        }
                    }

                    xrefs(source: "Ensembl") {
                        name
                    }

                    diseases(top:10000) {

                        name
                        mondoID

                        dids {
                            id
                            dataSources
                            doName
                        }
                    }
                }
            }
        }
    '''

    while True:

        response = pharos_general(query, variables)['targets']['targets']

        if not response:
            break

        result.extend(response)
        variables['step'] += chunk_size

    return result

pharos_general is the general GraphQL API handler. get_targets uses that handler to get required protein attributes. Since this query is big enough, paginating them 1000 by 1000 is error prone, we're highly likely to get timeout errors.

If we decrease that number to 100, it takes so long to reach 14000. One query takes approximately 5-10 seconds so we could benefit from async requests as the order of the responses doesn't matter. Similar example would be for getting drug-drug interactions from KEGG. Making thousands of (14000, again) REST requests without async support takes lots of time.

I think I made a point, please ask me anything about use cases if you want to.

motey commented 1 year ago

@deeenes wrote:

Buffy looks great, it's definitely an option we should consider. It has many of the features that we need, maybe we have to do some adjustments, but that might be cheaper than developing from scratch.

I would love to provide/invest-in some new features for Buffy for external usecases and also welcome contributions :) :rocket:

The features that we need is a full control over all HTTP parameters, GET, POST, multipart forms, binary payload, HTTP headers (both request & response), cookies, 30x redirects, disabling ALPN, downgrade to HTTP 1.1, retries, timeouts, FTP. And the two most important:

The library should not add headers and do session management on its own, or even better if it does, but can be disabled.

It should be able to provide detailed debug traces, like curl -vvv.

For any Buffy request the method, header-fields and request-body can be already fully customized. see https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffypyclient/buffypyclient.py#L685

One spontaneous idea as an improvement would be a just a new param containing a dict that will pas through any other params to request.{http-request-method}()

But yea, the design-philosophy/base-idea is to provide reasonable default values wherever we can but give the client/caller full control over the request params if needed. Any suggestion for improvements are welcome :)

Looking at Buffy, two important points: can we start the server automatically, and without docker?

Easy :)

from multiprocessing import Process
from buffy.buffyserver.main import start_buffy_server
backend_process, api_webserver_process:tuple[Process, Process] = start_buffy_server()

And same for redis?

Maybe, but i guess that will be a dirty hack :)

Also, if redis is a software to be independently installed, that complicates our installation, shall we add a Python key-value storage backend?

YES! redis is just my goto prototyping database. Actually it is bad in being a database for buffy :) Redis is better in being a f**** fast cache then being a persistent DB. I designed Buffy with a exchangeable storage backend in mind. The interface is fairly manageable https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffyserver/backend/storage/interface.py (i just realize that there is still some deadwood and missing docs strings in the interface. i will improve that soon™) Writing a sql/neo4j/mongodb drop in replacement module will be one day work i guess. As an example: the redis implementation is ~260 lines https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffyserver/backend/storage/redis_storage.py

A python based key-value store sounds very attractive but i am a little bit afraid regarding performance. Lets have a evaluation maybe. and lets see which solutions are on the "market" already. A Python native solution that allows Buffy to run in one context would be great!

MelihDarcanxyz commented 1 year ago

Are there any updates on this topic?

motey commented 1 year ago

sorry, a little busy atm with other stuff. i have your use-case on my todo list, but dont know if i manage to write a test-case for it this year.

Regarding an alternative storage implementation to replace redis(via Docker): atm, i dont have too much reasons to work on that :D i would prefer to enable your framework to handle containers. Also that would open up a whole new universe of tools.

MelihDarcanxyz commented 1 year ago

No problem at all. Sorry for disturbing you @motey, ıt's an open source project at all, you may not have time to look at it. Just wanted to learn what's the status so I can act on that.

saezlab / pypath

Future of the pypath.share.curl module #199