Open MelihDarcanxyz opened 1 year ago
I agree that this is an important topic for pypath "v2"; just my two cents: collaborators have built a download / caching application named Buffy, basically a smart downloader with file caching functionality, which in my eyes would be suitable for managing pypath downloads. We could try whether it works for our use case and maybe add missing functionality there, directly. This could replace download and temp file handling of pypath.
I am currently busy with finalising the BioCypher manuscript, so if there is immediate need to process something (eg these KEGG data), then I suggest adding a preprocessing module that gets the data independently and then work on the aggregated data. It may not be worth modifying the curl
submodule to do async if we want to replace it in the medium term. Also, the KEGG problem may not be general enough to warrant implementing something as extensive as this change.
Talked briefly about it with Tim @motey, the Buffy maintainer. We could implement async functionality there.
Buffy has a server/client architecture and in principle async calls are possible. Maybe a batch API in the client would be useful. @MelihDarcanxyz could you provide some use-cases, examples, urls i can improve/adpat Buffy around?
Very simplified, Buffy is a wrapper around the python requests
module. Until now this libray worked for me in all http
resource cases.
Hi, thanks for looking at this, it's indeed something we should address soon. The curl module in pypath is very old, not well designed and maintained, hence no surprise the code is convoluted and difficult to use.
Buffy looks great, it's definitely an option we should consider. It has many of the features that we need, maybe we have to do some adjustments, but that might be cheaper than developing from scratch.
The features that we need is a full control over all HTTP parameters, GET, POST, multipart forms, binary payload, HTTP headers (both request & response), cookies, 30x redirects, disabling ALPN, downgrade to HTTP 1.1, retries, timeouts, FTP.
And the two most important:
The library should not add headers and do session management on its own, or even better if it does, but can be disabled.
It should be able to provide detailed debug traces, like curl -vvv
.
Looking at Buffy, two important points: can we start the server automatically, and without docker? And same for redis? Also, if redis is a software to be independently installed, that complicates our installation, shall we add a Python key-value storage backend?
@motey Maybe based on the points above, you can advise how suitable Buffy is for the purpose?
Originally I would have preferred to have the cache management in a completely separate module. We should still decide about this.
I was thinking that Buffy could be this "separate module". Depending on how aligned our feature wishes are with what Tim has in mind for Buffy, we could just contribute there. Or we could write a shallow wrapper around Buffy that adds our required features.
@slobentanzer I talked about having the download manager and cache manager modules separated
@deeenes I see, got it
Buffy has a server/client architecture and in principle async calls are possible. Maybe a batch API in the client would be useful. @MelihDarcanxyz could you provide some use-cases, examples, urls i can improve/adpat Buffy around?
Very simplified, Buffy is a wrapper around the python
requests
module. Until now this libray worked for me in allhttp
resource cases.
Of course:
def pharos_general(query, variables=None):
url = "https://pharos-api.ncats.io/graphql"
req_headers = {
"Accept-Encoding": "gzip, deflate, br",
"Content-Type": "application/json",
"Connection": "keep-alive",
"DNT": "1",
"Origin": "https://pharos-api.ncats.io",
}
if variables:
binary_data = json.dumps(
{
'query': query,
'variables': variables
}
)
else:
binary_data = json.dumps(
{
'query': query,
}
)
binary_data = binary_data.encode('utf-8')
c = Curl(
url=url,
req_headers=req_headers,
binary_data=binary_data,
compressed=True,
compr='gz',
)
result = json.loads(c.result)['data']
return result
def get_targets(chunk_size=100):
step = 0
result = list()
variables = {
'chunk_size': chunk_size,
'step': step
}
query = '''
query targetDetails($chunk_size: Int! $step: Int!) {
targets {
targets(top: $chunk_size skip: $step) {
name
sym
uniprot
expressions(top: 10000) {
expid
type
tissue
value
uberon {
name
uid
}
pub {
pmid
}
}
gtex {
tissue
tpm
log2foldchange
uberon {
name
uid
}
}
orthologs(top: 10000) {
name
species
orid
dbid
geneid
source
}
ligands(top: 10000 isdrug: true) {
ligid
name
activities(all: true) {
actid
type
moa
value
}
}
xrefs(source: "Ensembl") {
name
}
diseases(top:10000) {
name
mondoID
dids {
id
dataSources
doName
}
}
}
}
}
'''
while True:
response = pharos_general(query, variables)['targets']['targets']
if not response:
break
result.extend(response)
variables['step'] += chunk_size
return result
pharos_general
is the general GraphQL API handler. get_targets
uses that handler to get required protein attributes. Since this query is big enough, paginating them 1000 by 1000 is error prone, we're highly likely to get timeout errors.
If we decrease that number to 100, it takes so long to reach 14000. One query takes approximately 5-10 seconds so we could benefit from async requests as the order of the responses doesn't matter. Similar example would be for getting drug-drug interactions from KEGG. Making thousands of (14000, again) REST requests without async support takes lots of time.
I think I made a point, please ask me anything about use cases if you want to.
@deeenes wrote:
Buffy looks great, it's definitely an option we should consider. It has many of the features that we need, maybe we have to do some adjustments, but that might be cheaper than developing from scratch.
I would love to provide/invest-in some new features for Buffy for external usecases and also welcome contributions :) :rocket:
The features that we need is a full control over all HTTP parameters, GET, POST, multipart forms, binary payload, HTTP headers (both request & response), cookies, 30x redirects, disabling ALPN, downgrade to HTTP 1.1, retries, timeouts, FTP. And the two most important:
The library should not add headers and do session management on its own, or even better if it does, but can be disabled.
It should be able to provide detailed debug traces, like
curl -vvv
.
For any Buffy request the method, header-fields and request-body can be already fully customized. see https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffypyclient/buffypyclient.py#L685
One spontaneous idea as an improvement would be a just a new param containing a dict that will pas through any other params to request.{http-request-method}()
But yea, the design-philosophy/base-idea is to provide reasonable default values wherever we can but give the client/caller full control over the request params if needed. Any suggestion for improvements are welcome :)
Looking at Buffy, two important points: can we start the server automatically, and without docker?
Easy :)
from multiprocessing import Process
from buffy.buffyserver.main import start_buffy_server
backend_process, api_webserver_process:tuple[Process, Process] = start_buffy_server()
And same for redis?
Maybe, but i guess that will be a dirty hack :)
Also, if redis is a software to be independently installed, that complicates our installation, shall we add a Python key-value storage backend?
YES! redis is just my goto prototyping database. Actually it is bad in being a database for buffy :) Redis is better in being a f**** fast cache then being a persistent DB. I designed Buffy with a exchangeable storage backend in mind. The interface is fairly manageable https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffyserver/backend/storage/interface.py (i just realize that there is still some deadwood and missing docs strings in the interface. i will improve that soon™) Writing a sql/neo4j/mongodb drop in replacement module will be one day work i guess. As an example: the redis implementation is ~260 lines https://git.connect.dzd-ev.de/dzdpythonmodules/buffy/-/blob/main/buffy/buffyserver/backend/storage/redis_storage.py
A python based key-value store sounds very attractive but i am a little bit afraid regarding performance. Lets have a evaluation maybe. and lets see which solutions are on the "market" already. A Python native solution that allows Buffy to run in one context would be great!
Are there any updates on this topic?
sorry, a little busy atm with other stuff. i have your use-case on my todo list, but dont know if i manage to write a test-case for it this year.
Regarding an alternative storage implementation to replace redis(via Docker): atm, i dont have too much reasons to work on that :D i would prefer to enable your framework to handle containers. Also that would open up a whole new universe of tools.
No problem at all. Sorry for disturbing you @motey, ıt's an open source project at all, you may not have time to look at it. Just wanted to learn what's the status so I can act on that.
There should be an async option since some API request take long to respond and we may have 14000 more requests to make which is a real case in KEGG database. I tried to implement
__await__
method toCurl
class but I wasn't successfull as thecurl
module is monolithic and obsolete.Other than that, there are so many arbitrary options and redundant code in Curl. I think it violates simplicity rule. Do one thing and do it well. I expect it to just connect to web and return a response, so I can handle the outcome. In contrast, this module does so many complex things that shouldn't be its responsibility and it doesn't fit to everything. I myself struggled with a different problem on every new database I tried to add. There were some connections which couldn't be made with our curl module but 3 different libraries which are: external
pycurl
library, externalrequests
library, standarturllibs
library.And for example aiohttp would do nicely for async calls. I understand that there may be a need for caching, I don't know caching capabilities of this libraries but I think we should keep things small and modular.
I said 'I' the whole issue and this resembles my ideas but this topic has bee discussed in a meeting with @slobentanzer and our professor. We agree that this topic may need another meeting.
So, is curl going to be replaced in the near future, how should we start to do the change, should we continue using it?