pramsey / pgsql-http

HTTP client for PostgreSQL, retrieve a web page from inside the database.
MIT License
1.3k stars 116 forks source link

async calls #105

Open merlinm opened 5 years ago

merlinm commented 5 years ago

Hi. Right now I'm working on an entirely stored procedure based coordination framework called pgtask. Think of it as a centralized cron for controlling database task execution in complex environments; I believe there to be a technical niche there. Anyways, I'm interested in being able to tap web services in addition to target remote database calls. For this to work in my architecture, I'd need to have some kind of asynchronous API; something along the lines of dblink_send for asynchronous querying (which I use to send queries since the coordinator is single threaded).

An ideal API would be able to do something like: ) http_send (name, ...) send a named http request ) http_request_complete(name, ) is the http request resolved? ) http_get_request (name, timeout, ...) get a response, blocking until timeout if not done ) http_wait_for_request(timeout) wait for the first available request to complete or until timeout. (epoll like interface to prevent polling/looping in sql) -- dblink really needs this ) http_cancel_request (name, ...) cancel an outstanding request ) http_list_requests() get a list of outstanding http requests

Food for thought. I don't know if this api (esp wait_for_request call) is technically feasible, but it would make this a much more usable extension since database queries are essentially can be considered single threaded environment. Anyways, thanks for your work!

pramsey commented 5 years ago

MM, yep, that's been a gleam in my eye for a long while and thanks for the API sketch! There's some facility for this in libcurl, in terms of an async API, but I haven't fully figured how to tie it into the PgSQL scheme nicely, since once you start up an async request, the whole who-owns-it question comes up...

I was thinking about an API based on callbacks rather than a queue, so your idea bears some thought. Something for me to ponder on my ride tomorrow...

merlinm commented 5 years ago

Great -- thanks for response. The challenge with callbacks is that there is no way (that I know of) to handle callbacks at the SQL level. The idea here is to implement some wrapper to something like epoll_wait() (http_wait_for_request) with a timeout. Upon receiving an interesting event, control would be returned to the SQL level with information so that you wouldn't have to loop the outstanding connections to determine which one is ready to return data.

pramsey commented 5 years ago

Which I guess would be OK, but users would have to two-step all their HTTP work... since you are building a scheduler that makes sense for you :) but other folks might find it a little fiddle to make use of, no?

merlinm commented 5 years ago

Well, it's useful in any single threaded application. If you look at asynchronous stacks like node, you can see that most i/o works this way. This type of work is not super common in database coding but since you made this project I think we can both agree that this is an underutilized frontier of development.

In general, robust programming of i/o requests outside the system either require threads or asynchronous behavior. Threads are right out in stored procedures, so that leaves async. Synchronous requests are easier to code and many people prefer them, but they are dangerous; you're blocking the thread and you have no wait to deal with the situation other than a crude timeout. So even in scenarios where there would only be one outstanding query at a time you'd still be better off using an asynchronous call if you wanted precise error handling.

The perfect analogy here is dblink; I use it a lot to transfer data between databases; if the operation needs to meet certain guarantees in terms of robustness it usually ends up being run over the asynchronous call (mainly so I can exploit cancelling but there are other reasons).

dblink really out to have a timeout on the is_busy call and something like dblink_wait_for_query to push the polling out of the sql level, but the api is good enough to work. The synchronous model you've built is fine and will certainly be good enough for most users but really strong i/o libraries will almost always have some kind of contrivance so the calling code can maintain control of the execution state.

merlinm commented 5 years ago

reading up on the libcurl api, the library supports async requests without threading (which would be required for this approach to be technically feasible). Here's in example I found: https://curl.haxx.se/libcurl/c/multi-app.html

In the API sketch above, if you were to do this, http_wait_for_request would probe into the 'select' and continue to load up the result until the SQL level timeout was hit or the still_running flag was set so that the request was complete. 'select' is of course polling underneath the hood, so we would not want to scale to huge number of concurrent requests, but this ought to be much more efficient than sql level poll. Thanks for listening, curious your thoughts.

slavanap commented 4 years ago

Just wanted to ask why http_get SQL function does not use PARALLEL keyword in it's definition, but found this.

merlinm commented 4 years ago

yeah, PARALLEL != async. PARALLEL means you can spawn multiple similar option, in query, in situations where query planner thinks it's a good idea to do so. async means you kick the operation off in the background and do other things while the operation is in progress. workarounds today would be to use dblink or pg_background.

deem0n commented 4 years ago

Hi all, I just want to share my expirience with PARALLEL execution. You can try to do

set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
explain select * from http(...) UNION ALL select * from  http(...);

It should show something like (look for Parallel Append)

 Gather  (cost=0.25..26.96 rows=4000 width=104)
   Workers Planned: 3
   ->  Parallel Append  (cost=0.25..26.96 rows=1292 width=104)
         ->  Function Scan on mltv  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_1  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_2  (cost=0.25..10.25 rows=1000 width=104)
         ->  Function Scan on mltv mltv_3  (cost=0.25..10.25 rows=1000 width=104)

But when http get called from Parallel plan you might get error:

DEBUG:  pgsql-http: queried 'http://127.0.0.1:8200/my/url/path'
DEBUG:  pgsql-http: http_return '56'
ERROR:  Failure when receiving data from the peer

It works fine in 10% of cases and I see this error in 90% of cases.

Is it possible to make http PARALLEL SAFE?

pramsey commented 4 years ago

Two confusing things. None of the functions are marked as PARALLEL SAFE, I wonder why you get a parallel plan? The http calls happen in distinct worker processes, I wonder why the calls fail, it's not like they are running in the same memory space on different threads or something. Should be no different than firing off multiple copies of curl at once.

deem0n commented 4 years ago

Hi, I am trying to reproduce my prod env and isolate the problem. Please check the gist how to make parrallel HTTP requests in PostgreSQL. Seems it works fine for GET requests! https://gist.github.com/deem0n/37f63978773d12a8de61289900c5cc03

Basically, I found glitch, when

  1. we make POST requests
  2. curl called from the parallel plan by postgresql engine
  3. curl getting 500 status and error 56 (when normally we should be getting 200 Ok)

instead of returning normal response with status 500 it produce Exception.

DEBUG:  pgsql-http: http_return '56'
ERROR:  Failure when receiving data from the peer

And finally, if we disable parrallel plan, then we getting 200 Ok responses.

pramsey commented 4 years ago

So, thank you for that gist. I tried it out, and it worked fine... parallel plan, and the 7-wide parallel query ran and returned 7 results nice and fast. Could there be something else going on, in your system?

deem0n commented 4 years ago

Thanks for verification!

Actually it could be something with the network settings, it is very hard to debug. I assume that pgsql-http works fine, and if not I will post detailed report why.

pramsey commented 2 years ago

For people interested in this topic, it's possible to get async HTTP queries by combining this extension with the pg_background extension.

riderx commented 2 years ago

@pramsey do we have example somewhere ? i struggle to understand how to mix them