lpsmith / postgresql-simple

Mid-level client library for accessing PostgreSQL from Haskell
Other
206 stars 71 forks source link

Use asynchronous calls to libpq #25

Open lpsmith opened 12 years ago

lpsmith commented 12 years ago

After a conversation with @joeyadams, I believe postgresql-simple ought to use asynchronous calls to libpq. We would then use threadWaitRead and threadWaitWrite to schedule asynchronous calls using GHC's IO manager, instead of using blocking C calls and having the OS kernel do the scheduling.

The performance impact of such a change is unclear, but the one clear advantage is that System.Timeout could then be used with connect, query, execute, etc. System.Timeout cannot be used with blocking C calls because GHC's runtime is not in control of the thread and cannot interrupt the thread until the call returns.

lpsmith commented 12 years ago

One thought does occur to me, that this could be done inside postgresql-libpq instead of postgresql-simple, to reimplement all the blocking libpq functions in terms of their non-blocking counterparts. I'm not sure how good or bad of an idea this is, but everybody using postgresql-libpq would then reap the advantages (and disadvantages) of this change.

Perhaps it would make sense to include these reimplementations in a separate module? Maybe we'd use two modules, with an identical interface:

 Database.PostgreSQL.LibPQ.Blocking
 Database.PostgreSQL.LibPQ.NonBlocking 

Then Database.PostgreSQL.LibPQ would re-export one module or the other (perhaps controlled by a .cabal flag).

snoyberg commented 11 years ago

As a single data point, a huge +1. I just ran into a bug (in my code, no postgresql-simple) which would have been very well served by the ability to call timeout.

lpsmith commented 11 years ago

Well, Joey has put in a fair bit of effort into researching these issues on Windows, and it turns out that threadWaitRead and threadWaitWrite don't work on network sockets (and perhaps files?) on Windows, and worse, that this isn't easily fixable because some very UNIX-y assumptions are baked into the interface. (In particular, network sockets aren't file descriptors.)

Another alternative would be to make the blocking calls to C interruptible, which would work on most unixes and Windows, but this requires a new-ish version of GHC and Windows Vista or newer. (Sadly, supporting WinXP is still very relevant for a lot of people, potentially even me.)

I stubbed out a build approach to reimplementing libpq's blocking calls in postgresql-libpq here: 96c82197b558019de44cc4b5ef728b95de99f0ab

There isn't any useful code at this point, this is just a way to cut down on CPP hackery in supporting both unix and windows. Then we just change all the import qualified LibPQ as PQ to import LibPQ.GHC as GHC in postgresql-simple.

I suppose we could just export the reimplementations from LibPQ directly. On the upside, it would result in slightly smaller code objects as we wouldn't need to include all the bindings to the blocking libpq calls. On the downside, I don't know.

In any case, it really seems that we need no less than three different bindings to the blocking libpq calls, one that implements the blocking in the GHC runtime via non-blocking libpq, one that implements them as interruptible C calls to blocking libpq, and one that implements them as vanilla safe C calls to libpq. Perhaps we could then have some autodetection magic to select which one to actually compile and export from LibPQ, with flags to override the autodetection scheme. (e.g. if you are compiling a program on Vista that you want to be able to run on XP or Windows Server 2003 as well, or for some reason you prefer to use one of the other options on unix, as they'd all work)

lpsmith commented 11 years ago

Actually, does anybody know if the interruptible FFI calls are "harmless" (other than not actually being interruptible) on older versions of Windows? Then the only downside would be bumping the required version of GHC to 7.2, which probably isn't a big deal for most Haskellers. I imagine requiring Vista would be much more problematic.

joeyadams commented 11 years ago

On Sat, Jul 20, 2013 at 10:58 AM, Leon P Smith notifications@github.comwrote:

Well, Joey has put in a fair bit of effort into researching these issues on Windows, and it turns out that threadWaitRead and threadWaitWritedon't work on network sockets (and perhaps files?) on Windows, and worse, that this isn't easily fixable because some very UNIX-y assumptions are baked into the interface. (In particular, network sockets aren't file descriptors.)

Actually, there are a few ways to wait on a socket in Windows. PostgreSQL sets the connection to non-blocking, and Winsock provides a couple ways to wait for a socket:

For the purposes of postgresql-libpq, calling select() with a timeout of 1 second or so would actually be reasonable. It's simple, and the wait can be interrupted (though with a small delay).

Thus, interruptible queries are actually pretty easy to support on Windows. We can bypass the brokenness of GHC's network I/O support on Windows, and we don't have to worry about scaling to thousands of connections since applications usually don't need to connect to thousands of databases.

Actually, does anybody know if the interruptable FFI calls "harmless" (other than not actually being interruptable) on older versions of Windows? ...

On Windows Vista and later, interruptible FFI calls 1 are implemented with CancelSynchronousIo 2. This means whatever IO libpq is doing will fail with ERROR_OPERATION_ABORTED. I don't know if libpq will handle it well, though; my guess is that it will treat the connection as bad thereafter.

On Linux, it uses pthread_kill with SIGPIPE. If this interrupts libpq logic instead of a blocking system call, it could leave the connection in an invalid state, but I don't know if this is possible.

Either way, what exactly should interrupting a query with an exception do? (1) Tell the server to cancel the query, (2) just stop waiting for a result and assume the connection died, or (3) some combination of the two? Interrupting the blocking I/O would only do (2).

lpsmith commented 11 years ago

I meant that you can't simply fix threadWaitRead and threadWaitWrite on Windows. =) In any case, thanks for the recap.

After thinking about it a while, I'm inclined to make assuming the connection died (2) the default/only behavior directly supported by postgresql-libpq. It should be able to attempt to cancel the query and recover the connection (1) from Haskell client code, at least when using non-blocking libpq calls and the GHC IO manager. But as you point out, this might not be possible when using interruptible ffi calls. The interaction with libpq and interruptible calls sounds like a good question for pgsql-hackers or #postgresql. Anybody feel like asking?

ocharles commented 10 years ago

Is there any update on this? We just hit a problem in our test suite where one connection is blocked on TRUNCATE because another connection has a transaction open. If a context switch could happen, everything would be fine - but the TRUNCATE blocks that and the program hangs indefinitely. While I think -threaded solves this problem, it was a bit annoying to have to find that out the hard way.

lpsmith commented 10 years ago

@ocharles, it sounds like you forgot to link your executable -threaded, as the particular issue you describe shouldn't be a problem if you are using the threaded runtime.

lpsmith commented 10 years ago

That said, this is still a good idea, but no I haven't worked on it.

ocharles commented 10 years ago

Yea, I found that out just after adding the comment (typical!) - I've amended my earlier comment to mention that.

lpsmith commented 10 years ago

Heh, it happens. =)

There are a few details to hash out to using libpq asynchronously. As I said before, I think the most reasonable option is to assume that the connection is dead on timeout or other async exception, but the question is then how can the connection be cleaned up in a timely fashion, preferably as cleanly as possible? I certainly wouldn't want to leave backend processes running on the server for very long.

So, for example, is it ok to call PQfinish on the connection while some other asychronous operation is ongoing?

rjmac commented 10 years ago

I've taken a look at implementing the libpq blocking api in terms of non-blocking calls (without bothering about the Windows issues) to see what would be involved, and I don't think it can be done with full fidelity using only the public API. It's can certainly be done by poking into PGconn and PGresult's contents, but then you've lost compatibility guarantees across different versions of libpq.

lpsmith commented 10 years ago

Very interesting, thank you. It's unfortunate that you can't fully implement this without delving into the internals a bit, although perhaps this is acceptable if the benefits are great enough. After all, at least on my machine the libpq-int.h header is already there, and just by reimplementing a few functions we are a little bit committed to keeping implementations in sync anyway. (Which version of libpq did you use to reimplement execStart, by the way? Do you know if there are any differences among the 9.* versions?)

I suppose the alternatives are to either let postgresql-libpq clients reimplement this on an ad-hoc basis, which seems rather undesirable, or release a new module that implements a not-quite-exec in terms of the public interface. (Whether or not this module would be part of postgresql-libpq or a separate package I suppose is another question.)

lpsmith commented 10 years ago

I'd really like to see some benchmarks of @rjmac's work. Does anybody have some code that's suitable? I can't say that I have anything suitable at the moment.

We really need something database-bound, both in a single-threaded and concurrent setting.

rjmac commented 10 years ago

I used the latest git source for execStart, but it hasn't been modified in a substantial way since 2011 when bidirectional COPY was added, sometime between 9.1 alpha 2 and 9.1 alpha 3. For binary compatibility I'd be much more worried about the way it pokes into PGconn and (if PQexec's error-reporting behavior is to be duplicated) PGresult, since if the layout of those structures change between releases, postgresql-libpq would have to be recompiled to match. That doesn't exactly have high churn either — struct pg_conn last changed in a way that would break binary compatibility in 2012 — but it does happen occasionally.

libpq-int.h looks like it's a standard part of any dev install of libpq, but there's a big scary warning at the top of the file about how it doesn't have stability guarantees across releases.

For what it's worth, I didn't do any benchmarking. I was only curious about whether it could be done, not about performance.

lpsmith commented 10 years ago

Ok, I think I fixed this issue (in postgresql-simple, not postgresql-libpq) with commits 45ccff4be6bcc8e0f4d70efdaf23a5028a50588c and 83d3c38386af90690f56bb8501167bc74180b06b. This takes care of exec and connect, and I don't think there are any other blocking calls to libpq, but I could be wrong. (Edit: I am wrong. The LargeObject module still has a bunch of blocking FFI calls, and the Copy module is mostly async but does have a few that potentially block. Also, sendQuery is also blocking in the commits referenced above.)

I don't really have a proper benchmark, but I did time the test suite and any difference was easily lost under the timing noise on my laptop. So if there is a performance regression, it's not a huge one.

This was an expedient fix, so unfortunately not all users of postgresql-libpq will benefit from this change. But there certainly are a lot more issues to work out with that approach.

lpsmith commented 10 years ago

Actually, it turns out that sendQuery is still blocking, so while this issue is mostly resolved (and in fact appears to work well enough for my particular use case), it would appear that there is still more work to be done.

I would guess that sendQuery would normally only block if the query is large enough that it doesn't fit inside the network socket's write buffer. One could probably demonstrate the weakness in the current implementation by establishing a connection, using iptables or another means to quietly break the connection, trying to send a large query, and then trying to interrupt the thread that's sending the query.

This is fixable, using PQsetnonblocking, however, the libpq docs are a little sketchy about how exactly to distinguish PQsendQuery's equivalent of EWOULDBLOCK from a genuine error.

rjmac commented 10 years ago

From what I can tell by inspection, if PQsendQuery returns 0, you need to call PQflush and examine its result to find out if it's an actual failure. 0 means "fully sent, you can await the response", "1" means "not done writing yet, wait until the socket is writable then try flushing again", and "-1" means "an error occurred".

lpsmith commented 10 years ago

Hmm, thanks rjmac! That is in the documentation of PQflush, but I'd missed that, and the connection between PQflush and a non-blocking PQsendQuery isn't as obvious as perhaps it should be.

Reviewing your code again more carefully, it doesn't appear that you do this in your code either, at least not yet. I do think your attempt is interesting, and worth further consideration at some point, but ultimately for me it came down to that I needed to be able to time out database heartbeats and connection attempts. (Although I suppose I could have used joeyadams' network-socket-options package to do this as well.) I do worry slightly about additional pain caused to users who might not be fully aware of the ABI dependencies we'd be introducing, but all in all the advantages to your approach seem to outweigh that disadvantage.

rjmac commented 10 years ago

It does do that, actually, though it's hidden among all the other fully-emulate-blocking-mode noise -- it's in flushOutput, which is called right at the top of execFinish.

lpsmith commented 10 years ago

Oh I see. Nice. =)

Well, anyway a conversation on #postgresql did turn up a bug in libpq that may be a good reason for postgresql-simple to avoid going setnonblocking True full-time: namely PQputCopyData only says that it would block when it can't reallocate the buffer, which means it's difficult to impossible to feed data to PQputCopyData at the proper rate in non-blocking mode.

So perhaps it would make sense to setnonblocking True at the start of the connection and then bracket PQputCopyData with a setnonblocking False ... setnonblocking True.

lpsmith commented 10 years ago

Also, see #119 as an example of an issue that's arisen from my quick-and-dirty re-implementation of exec. There's definitely a bit of subtlety when dealing with the COPY case that wasn't immediately apparent from the libpq documentation. (And, the fact that this is an issue at all seems rather strange.)

lpsmith commented 10 years ago

Ugh, I think I'd prefer to ditch my quick-and-dirty async implementation and just use rjmac's, but the one thing that makes me hesitate is the fact that in order to compile his code, I needed to also install the server-dev header files as libpq-int.h itself includes some of these files. Which isn't any big deal for me, but it would make postgresql-libpq less user friendly.

rjmac commented 10 years ago

My main worry about doing things this way is less user-friendliness -- libpq-int.h's dependencies are a package install away on most platforms -- than with binary compatibility. The only really dangerous thing it does (if I recall correctly; it's been a while) is poke at the pg_conn's asyncStatus field directly. Everything else is via C accessors which are less likely to break despite not being technically part of the public API. But that one direct poke could cause really hard-to-diagnose problems across even minor libpq versions.

I wonder if the postgresql people could be convinced to add an accessor for it.

lpsmith commented 10 years ago

Well yes, that's a concern of mine too. I guess it's mitigated in my mind a bit by the fact that it doesn't change often. But yes, the consequences of such a change are pretty bad.

Perhaps it would be a good idea to check the version with PQlibVersion and then complain loudly or error out if it doesn't match the version it was compiled against? This would be way overly conservative, but it may also be the best we can do.

Adding an accessor may be a good idea, as we could be less conservative in the future. However, are there any symbols exported from a vanilla libpq shared object that aren't part of the public interface? I ask because I started playing around with the first steps towards a native haskell implementation of the protocol, using libpq to connect, but pqsecure_write and friends aren't linkable from the shared object even though they are in the libpq-int.h header file. (So I had to compile my own libpq for the purposes of that preliminary experiment.)

I can't imagine the postgres folk would be too enthusiastic about such a symbol, but perhaps it's worth pointing out that exec cannot be faithfully re-implemented using the public interface alone, and finding a solution for that.

Regarding user-friendliness, I agree that I don't think it's going to be an issue on most unix platforms, though people will probably be caught a little off-guard and it might take them a few minutes to track down the issue. Windows, on the other hand, seems to be the real problem point, but I'm not familiar enough with the compilation issues there.

(But what am I saying? This code isn't at all usable on Windows for the foreseeable future, so it's not an issue on Windows. So... yeah I agree, user-friendliness isn't much of an issue.)

lpsmith commented 10 years ago

Ok, if we want to compare versions, the compile-time version appears to be defined in pg_config.h, the link-time version is available by calling PQlibVersion. At least on Debian, libpq-dev provides pg_config.h as well as libpq-int.h, so it would appear as though it's fairly safe to assume the defined version matches the version of libpq-int.h.

lpsmith commented 10 years ago

On the other hand, going this route would also require recompilation of postgresql-libpq every time libpq is upgraded. Debian packages an older version of postgresql-libpq, and those package maintainers might not appreciate these sorts of shenanigans, as it would certainly impact their job in some less-than-desirable ways.

NixOS may be better suited to this arrangement, and would certainly be more sympathetic as well.