haskellari / postgresql-simple

Mid-level client library for accessing PostgreSQL from Haskell
Other
85 stars 43 forks source link

Fixes #69 #71

Open mzabani opened 3 years ago

mzabani commented 3 years ago

I still haven't had time to test this with a project that makes extensive usage of postgresql-simple, but wrote a test that fails without this PR and passes with it.

I thought I'd push this because I see some activity on investigating this problem, and maybe this PR fixes it, so I hope it is helpful. I also hope to have more time to improve this any way we/you think would be good over the weekend.

Thoughts, corrections or even "you've got it all wrong" are very much appreciated; it's my first time contributing to a project like this.

mzabani commented 3 years ago

I tested this during the week with a project and it seems to have worked well. I'm not sure what the proper review procedure is, but I see @phadej in a lot of other PRs, so I'm citing you here too. Sorry for the unsolicited request; please let me know if there are others I should request review from for this.

mzabani commented 3 years ago

What if another async exception is thrown immediately, interrupting cancelAndClear? It probably should be masked.

Good point. I've never used libpq directly before, so my understanding here is very limited. I'm reading https://www.postgresql.org/docs/current/libpq-cancel.html and https://www.postgresql.org/docs/current/libpq-async.html but it's not clear to me whether cancelAndClear might block indefinitely (or for too long). Could that be a problem?

I also miss context. If the original issue was with using pool, isn't a problem that this invalid connection isn't removed from a pool. I.e. a solution is just to not use that connection anymore?

I didn't follow the pool discussions closely, to be honest. But before this PR, there are at least two actions that can lead into problems, IIUC:

phadej commented 3 years ago

might block indefinitely (or for too long). Could that be a problem?

I don't know. If idea is to kill a thread, maybe?

mzabani commented 3 years ago

try someQuery is a non-issue. You shouldn't catch asynchronous exceptions with try. (Use safe-exceptions etc.)

I think I started out with a bad example. My focus should've been on timeout: an existing, reasonable function we find in production code that uses asynchronous exceptions internally, whose use can leave the connection in an invalid state.

queries in exception handlers also seems contrived.

Maybe. But would we want to disallow this? It seems perfectly valid to do so. Also, doesn't this very library do this to execute ROLLBACK in withTransactionModeRetry' ? Logging is something else I can think of from the top of my head someone might want to do.

might block indefinitely (or for too long). Could that be a problem?

I don't know. If idea is to kill a thread, maybe?

I don't have much experience with exception handling in such cases. But thinking about it now, it seems if libpq can block or take a bit longer in such cases, that we'll already be susceptible to that behavior with this PR. Given that, we might as well mask the exception handler. What do you think?

mzabani commented 3 years ago

I was looking into masking the exception handler and discovered in https://hackage.haskell.org/package/base-4.14.1.0/docs/Control-Exception.html that "There's an implied mask around every exception handler in a call to one of the catch family of functions.", and it also seems to apply to onException. If interruptible masking is what we want this PR should be good as it is - and it seems like interruptible masking is what this library does elsewhere.

But perhaps you meant we should use uninterruptible masking? I must admit my understanding of masking, interruptible operations and asynchrous exceptions is too much on the "short" side to make a decision on this.

mzabani commented 3 years ago

After reading https://github.com/fpco/safe-exceptions/issues/3 (but admittedly not having read the entire conversation too closely) and https://gitlab.haskell.org/ghc/ghc/-/issues/18899, it seems to me that:

  1. If our cleanup action can take too long, that is bad enough already: it's very likely that in many cases a second asynchronous exception will not be sent to our thread, and thus it will block for a possibly long time in most use cases anyway. ~This speaks in favor of uninterruptible masking the cleanup handler.~ FUTURE EDIT: Not really, it's at most neutral, see the rest of the conversation.
  2. There seems to be some consensus that cleanup handlers are better off with uninterruptible masking by default, and that is what safe-exceptions does, in fact, so it's what would happen if this library used that instead of base for exception handling.
  3. I've measured how much time it takes in the test this PR adds to run cancelAndClear: 1.2ms, measured over 10 runs.

So I'm more and more inclined to do uninterruptible masking.

mzabani commented 3 years ago

This PR adds 4 new tests. The first two pass on master and on this branch. The other two do not pass on master. The last one is a possibly problematic change of behavior.

  1. "Expected user exceptions" is more of a check that I wasn't severely breaking any existing behavior.
  2. "Query canceled" is also a check like the one above, but I believe it detects two distinct exceptions for queries being canceled - a problem, IMO, but one that is not related to this PR.
  3. "Async exceptions" ensures the connection remains usable after async exceptions.
  4. "Connection terminated" was created to also test if this PR preserved behavior, but then I noticed that master throws "Exception: Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)" when pg_terminate_backend(pid) runs. This PR changes that, making it throw SqlError {sqlState = "", sqlExecStatus = FatalError, sqlErrorMsg = "server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n", sqlErrorDetail = "", sqlErrorHint = ""} instead.
    • For this test and change of behaviour it was necessary to change the type of Database.PostgreSQL.Simple.Internal.throwResultError. Please let me know if that's a problem.

Sorry for the long monologue spamming this thread, @phadej, but can I get another review? The tests seem to have failed due to non related intermittent problems, and I can't retry them.

Yuras commented 3 years ago

@mzabani The code itself looks great! Though I have concerns regarding the approach.

You simply can't guarantee that connection is valid in case of exception. Any network error will leave the connection invalid, and any subsequent query will fail. Also if PG.cancel fails for some reason, e.g. it failed to connect to the database, connection will be left in an invalid state. It's better not to reuse the connection at all in that case, and in fact most people do that already including withResource.

Also the connection might die after it's returned into a pool, but before it's reused. So application code have to be prepared to handle the case when connection is invalid.

If you don't reuse connection after exception, then you don't need to rollback the transaction there. It's enough to close the connection, and database will rollback the transaction. So IMO the real issue is withTransaction, which should not exist in the current form in the first place. If you need a transaction, you could just begin and commit it without any bracket-like operation:

withTransactionMode mode conn act = do
  beginMode mode conn
  r <- act
  commit conn
  return r

Cancelling query under uninterruptibleMask will cause unexpected delays and application freezes. Consider the case when network connection becomes very slow, e.g. packet drop rate increases and TCP tries to recover. Also consider the case when we are expecting a huge amount of data from the database, like in COPY TO. Or PG.cancel takes a lot of time because creating new connection is too slow for some reason. And so on.

So I'd suggest

mzabani commented 3 years ago

You simply can't guarantee that connection is valid in case of exception.

update documentation, stating that any exception from exec and friends leaves connection in an unknown state, and the only thing you can do with the connection after that is to close it immediately,

I think I agree with this partially. There are exceptions which do not leave the connection in an invalid state, can be properly detected and still allow users to use the connection after that. One example is SELECT 5/0. Another is SELECT 1, 2 :: IO [(String, String)]. So I'd say:

Cancelling query under uninterruptibleMask will cause unexpected delays and application freezes. Consider the case when network connection becomes very slow, e.g. packet drop rate increases and TCP tries to recover. Also consider the case when we are expecting a huge amount of data from the database, like in COPY TO. Or PG.cancel takes a lot of time because creating new connection is too slow for some reason. And so on.

I must admit I'm not experienced with Haskell's exception world, but wouldn't slowness under poor network conditions be a problem with or without uninterruptibleMask? Also, if the connection is dropped while canceling, wouldn't that actually be a synchronous exception, so uninterruptibleMask would also not change a thing? But I do agree canceling can introduce a delay since it didn't exist before; I just think that should be the case regardless of masking.

If you need a transaction, you could just begin and commit it without any bracket-like operation:

This is only true if we accept that users must discard connections when an exception is thrown, though, right? Otherwise we need onException rollback (although masking does seem unnecessary).

I think I can summarize the options we are discussing, but let me know if I misunderstand things:

  1. If we tell users they must discard connections when any exception is thrown, life is easier for us: no need for canceling anything, rollback etc.
    • But this is a breaking change. It is currently possible to use the connection after e.g. a division by 0 error, amongst others.
    • It is a far more limiting approach to users, and unnecessarily so, IMO.
    • It doesn't require just telling users to discard on exceptions. We also need to tell them not to use timeout, amongst other library code that might use asynchronous exceptions internally.
  2. Otherwise, we must strive to keep the connection usable for some kinds of errors, which is what this PR does.
    • Specifying exactly which might be challenging, but the problem already exists and we're only making it less bad with this PR.
mzabani commented 3 years ago

Oh, this PR adds a minute to Windows builds due to a new test and the distinct implementation of exec.. https://ci.appveyor.com/project/phadej/postgresql-simple/builds/38557708/job/oge66mh0sg5hq53i

A whole minute is not necessary for the test; it can be reduced to something like 5 seconds. But the separate code paths date from 2014, so maybe the Windows-specific code path is no longer necessary?

Yuras commented 3 years ago

Some exceptions should not by themselves force users' hands into discarding the connection right away.

How user is supposed to know which kind of exception he is dealing with? He needs to decide whether to reuse the connection. It's OK to provide a closed list of exceptions that keep connection valid.

Also, if the connection is dropped while canceling, wouldn't that actually be a synchronous exception,

Not sure I understand. In the section you quoted I was talking about PQcancel trying to connect to the database, which might take unlimited time. Yes, eventually it will either succeed or throw synchronous exception, but all this time we have to wait without being able to interrupt it.

But I do agree canceling can introduce a delay since it didn't exist before; I just think that should be the case regardless of masking.

Without uninterruptibleMask you can still interrupt the cleanup with Ctrl-C or timeout. But I'm not suggesting to cancel query without masking, I'm suggesting not to cancel it at all.

But this is a breaking change.

It's just documenting the current behavior. Most users discard connections already anyway.

We also need to tell them not to use timeout, amongst other library code that might use asynchronous exceptions internally.

Not sure I understand why. You need to be careful to make sure you don't reuse connection, but it's possible and not that hard.

Also it's not the case that the connection is completely invalid, you just don't know in what state it is right now. So you can run query using such connection, but you have to be prepared to handle failure. But you need to be prepared to that anyway, so no difference.

Anyway, I'm not maintainer of this library, I'm just regular user. You don't need to convince me, I'm only expressing my concerns.

mzabani commented 3 years ago

I know you mentioned you're not the maintainer, but you've raised some good points and gave me things to think about, so I'll take the liberty of replying to some of them because I think they're very pertinent. But by no means I want you to feel obliged to reply, although it'd be nice to hear more of your thoughts.

Yes, eventually it will either succeed or throw synchronous exception, but all this time we have to wait without being able to interrupt it.

Ah, right, of course. Sorry, I misunderstood completely. I've been thinking about this over the last few days, and I've mentioned that it's still tricky to me (you can read my monologue in this thread). Reviewing my own arguments defending uninterruptibleMask, I'm not as convinced now. It feels like interruptible masking would be better due to the guarantee that the thread can be killed and the app can continue running, and the downside (IIUC) of possibly having an invalid connection will remain - but since it already exists, and as you said, it can be dealt with by discarding the connection, it might be better of the two options we have. I've reverted this PR back to interruptible masking, but feel free to let me know more about exception handling if it's still not as good as it could be.

It now feels to me like like anytime cleanup handlers do networking, disk, or anything with a good chance of being delayed by factors beyond our control, that uninterruptible masking is too high a risk. Although that's a different discussion, and one beyond my depth.

I'm suggesting not to cancel it at all. Not sure I understand why. You need to be careful to make sure you don't reuse connection, but it's possible and not that hard.

But I think the purpose here is precisely that sometimes we want to be able to use the connection after some kinds of errors, and we need canceling in order to do that. Also, some exceptions are not thrown to our thread, so it's easy to miss them completely. Take, for example:

timeout (1000000 * 10) $ query conn "SELECT very_long_running_query"
query conn "SELECT some_other_query"

I'd argue it's perfectly valid code, but if the query doesn't return in 10 seconds, the second query will fail every time (and it doesn't have to). Even more subtle is just running timeout 10000000 $ query conn "SELECT long_running" and not running any other query after the timeout elapses: the connection will be put back into the connection pool in a broken state, and the next time it's used it will fail in the very first query.

I guess my point is that although we must be prepared to handle failures at any time, there are some unnecessary ones that can be avoided, one of which this PR fixes.

Yuras commented 3 years ago

@mzabani Looks like we agreed that uninterruptibleMask is problematic here. Now I claim that not using it is problematic too. Consider the following gist:

timeout (1000000*60) $ do
  someCode
  timeout (1000000*10) $ query conn "SELECT very_long_running_query"
  moreCode

The following scenario is possible:

As a result the action under the outer timeout took 61 minutes instead of 1 minute. Basically we were handling asynchronous exception and got another one, and we have to ignore one of them, which is a disaster.

Note that only the first exception needs to be asynchronous in this scenario: we'll happily ignore it even if the second one is synchronous. So ideally cancelAndClear should be executed under uninterruptableMask and also should not throw synchronous exceptions!

I'd argue it's perfectly valid code, but if the query doesn't return in 10 seconds, the second query will fail every time

That's what I meant by "You need to be careful to make sure you don't reuse connection". I claim that the code is invalid since it reused the connection. You should not use timeout around query at all or use it outside of withResource:

timeout (1000000*10) $ withResource pool $ \conn
  -> query conn "SELECT very_long_running_query"

If you really know what you are doing, then you either manually destroyResource somehow or cancel the query:

timeout (1000000*10) (query conn "SELECT very_long_running_query")
  >>= maybe (cancelAndClear conn) return
query conn "SELECT some_other_query"

Yes, it makes code less composable, but there is no good way to fix it.

Yuras commented 3 years ago

Well, we somehow ended up discussing some ugly corner cases. But what's the actually issue you want to address? I think every existing library based on postgresql-simple uses resource-pool and closes connection in case of any exception. They do it incorrectly (i.e. try to rollback the transaction before closing the connection), but cancelAndClear might not be the correct fix for them either way.

mzabani commented 3 years ago

Well, we somehow ended up discussing some ugly corner cases.

Quite interesting ones, though. Your last example with timeout is really interesting, and I was surprised consequences could go beyong not cleaning up things. Makes me want to go back to uninterruptible masking... ok, let me stop before I begin.

should not throw synchronous exceptions!

Good point, and it's a problem that I need to address, because it currently doesn't hold.

I think every existing library based on postgresql-simple uses resource-pool and closes connection in case of any exception

Personally, I believe connection pool users aren't the only consumers of this library, nor should we expect them to be (I've been working on a project, codd which isn't, for example).

But still, using resource-pool does not save you from the issue this PR tries to fix (or rather: make less likely): the "libpq: another command is already in progress" error, if the query and exception happen in a thread separate from the "main" thread. Your examples show this too, it's user code where the problem could be (and I'd guess where it most frequently is).

It is a problem I've seen at the shop I work at, and many others appear to have seen it too: https://github.com/lpsmith/postgresql-simple/issues/177, https://github.com/tomjaguarpaw/haskell-opaleye/issues/297 and https://github.com/yesodweb/persistent/issues/1199. It's also very tricky to diagnose, and I'd say is bound to happen now and then (the raised issues kind of confirm this).


From reading your last two comments, do I understand correctly that you propose one of the two below?

  1. We should document limitations around postgresql-simple handling of exceptions to make users aware of them. This means not running queries inside timeout or other threads or being prepared to close the connection if timeouts elapse or the threads receive asynchronous exceptions. No help from the type system, users must be careful.
  2. Or we should find a better way to fix the problem this PR tries to address.

I'm all for number 2, but it's clear something I'm not currently capable to do. My opinion is that this PR may not fully fix the underlying problem, but it will probably improve the current situation. Because anything we come up with has pros and cons, I propose we view the possible scenarios under a model of likelihood. Suppose postgresql-simple is waiting for query results, and then asynchronous exception(s) is/are thrown. Suppose the cleanup handler does not throw synchronous exceptions, except maybe when the connection is completely closed (still an open question if it can be avoided even in that case, though).

One single asynchronous exception thrown (e.g. single timeout or killed thread):

Good connection to the DB Slow connection to the DB
Without this PR Risks running into issue #69, no other correctness issues otherwise Risks running into issue #69, no other correctness issues otherwise
With this PR, interr. masking < 10ms added runtime*, no errors except maybe making timeout be delayed by this added runtime wrt what was requested Same as "good connection", but with longer delays. Other connections have a good chance of being slow too, so not necessarily a big problem per se. If the cleanup handler throws due to the connection being closed, it's very likely there's a bigger and more general connectivity problem "anyway"
With this PR, uninterr. masking Same as above Same as above

One asynchronous exception thrown, followed by a second one (e.g. user pressing Ctrl+C, nested timeouts or other more complex async patterns)

Good connection to the DB Slow connection to the DB
Without this PR Risks running into issue #69, no other correctness issues otherwise Risks running into issue #69, no other correctness issues otherwise
With this PR, interr. masking Only a problem if the second exception happens within the short interval during which cancelAndClear runs. But then, big and bad possible correctness issues arise, including timeouts not being followed at all Pretty bad, high level of certainty of bad correctness issues arising.
With this PR, uninterr. masking Same as "good connection" for one single exception User's Ctrl+C or thread killing will have to wait for a possibly long time. If things take really long, this becomes a "zombie" thread (but then there might be connectivity issues elsewhere too) and the entire application must then be killed or the user must wait as long as necessary.

You've mentioned that we're discussing corner cases, and I agree. May I add to that that slow connections are also unlikely themselves? Most postgres users will be connecting over LAN, which should work pretty well. To add to that, I'd guess in many cases where one connection is slow, it shouldn't be unbearably slow, and even then, it's very likely other connections are also slow (the server might be overloaded, but only one single connection being slow? Sounds unlikely), and the application will already be suffering from a general slowdown.

But given that is all recently learned for me, can I ask you to correct my table or propose other ways to look at this?

Yuras commented 3 years ago

The issue in persistent was fixed by not rolling back on async exception, which is the right thing (except it should not roll back on sync exception too). I didn't check other issues, but I guess something similar should be done there.

You are right that we should not assume that all users use pooling. My point is that most of them do, and adding query cancellation will make things even worse for them: instead of connection error (which they handle in some way or another anyway) they might get deadlocks (or long locks).

In you application, where you don't use pooling, you are probably already handling connection errors, and another command is already in progress is just another kind of it.

(The table looks about right btw)

pearcedavis commented 3 years ago

I agree with @Yuras that we shouldn't be attempting to roll back on exceptions, and that adding query cancellation is a bad idea in general. If automatic query cancellation is added, I'd like to retain the existing behaviour as well.

It seems that the main use-case here is using timeout with database queries, with or without transactions, while ensuring that the connection is in a reasonable state at the end. As has been mentioned, using a connection pool can be a reasonable way of dealing with this (because we accept that the connection is probably broken and let it be destroyed), but perhaps we could add some documentation/examples/helpers for working with timeouts and cancels? For instance, we could have something like withTimeout or withCancel that will check the connection and close/reset/cancel as appropriate.

This might not be the right approach for this library, but we could also consider embedding connection pooling (via resource-pool) so that by default connections are pooled and automatically destroyed when an exception occurs. It might be hard to do this in a sensible and backwards compatible way, so maybe a separate library for combining resource-pool and postgresql-simple would make sense. (I know there are some libraries that technically do this already, but I'm thinking of something quite lightweight.)

mzabani commented 3 years ago

Sorry for taking so long to say something, I've had a busy few weeks.

I certainly don't want to add either risks of incorrectness or non-interruptability if there are concerns about it. The exploration in this thread was full of new things to me, and my takeaway is that using timeout or forking work to threads that use resources the main thread still holds - not just with postgresql-simple connections - will generally be something to be aware of and careful with, since it's not always possible to keep those resources valid for use in certain contexts without accepting questionable tradeoffs - of which this is certainly one.

but perhaps we could add some documentation/examples/helpers for working with timeouts and cancels? For instance, we could have something like withTimeout or withCancel that will check the connection and close/reset/cancel as appropriate.

This sounds reasonable. Maybe we should just add cancelAndClear as is in this PR (maybe with some modifications and with a better name), an explanation of the problem and samples in the lines of something like (drafty sample):

timeout functionThatUsesConnection
cancelAndClear -- necessary in case a query was interrupted by timeout
-- now you can use the connection again
mzabani commented 1 year ago

I have taken the liberty of implementing and pushing the new approach I suggested in https://github.com/haskellari/postgresql-simple/issues/69#issuecomment-1226440176. I thought it would be easier to see what I meant by pushing actual code.

Please feel free to let me know if the new approach is inadequate, or if you want me to describe it or parts of it in more detail.

Sorry to poke you again with this, but can @Yuras and @phadej take a look at the most recent commits?

Yuras commented 1 year ago

Looks good to me, definitely better then the initial design.

I have a question though: what will happen if client is interrupted after sending the first part of a query, but before the rest is sent? I.e. query might be large enough to be split into multiple TCP packages, so server will receive the first part, but won't receive the reset. Will cancellation work in that case? Will it fail (which is OK as well)? The concern is that server (or libpq) will interpret this situation as there is no query in progress, and happily allow one to send another query, which will be interpreted by server as a continuation of the previous one, leading to a disaster.

Yuras commented 1 year ago

(To clarify: I think the design is much better because if you don't reuse connection on exception, then cancellation won't be performed, and nothing changes for you.)

mzabani commented 1 year ago

I have a question though: what will happen if client is interrupted after sending the first part of a query, but before the rest is sent?

I honestly have no idea what could happen if a query is partially sent, or if it is sent in its entirety but some libpq control data isn't. At least, IIUC, this problem already exists in some form: any next query would also run into some problem, although I can't say how/if things would differ from query cancelation. I don't know much about Haskell's FFI, but ideally PQsendQuery would not be interruptible (I say this very lightly - I have no experience with FFI and I'm sure there would be other consequences).

The concern is that server (or libpq) will interpret this situation as there is no query in progress, and happily allow one to send another query, which will be interpreted by server as a continuation of the previous one, leading to a disaster.

This is way beyond my knowledge. I hope libpq/postgresql's protocol has control and data flows separable, so it can signal e.g. beginning-of-statement, among other things. But I know nothing about the protocol, so I'm only speculating.

IIUC, these concerns aren't much aggravated by this PR, though?


Also, I've made some changes to functions' signatures, such as throwResultError and finishQueryWith'. These functions are in the Internal namespaces, but they are exposed in the cabal file, so these changes could break external packages using them.

It is very easy to revert them; the only loss is the nice error message tested in this test that I added during my exploration of the codebase, but the original issue this PR proposed to fix should still be fixed without it.

So I'll revert them until (hopefully) the end of the week. The nice error message can come separately if we want it, I suppose.

mzabani commented 1 year ago

I've preserved functions' signatures and tested this branch against a personal project's test suite and also against my work's project's test suite (which is much larger than my personal project's) and they both passed.

This is ready for review.

mzabani commented 1 year ago

Friendly ping of my last review request. I understand everyone is busy, so if there's something I can do to facilitate this, please let me know.

mzabani commented 5 months ago

I'm still interested in merging this, and willing to put in more work if necessary. @phadej do you think you or someone else would have time to review this? I can merge master into this or rebase, whatever your preference might be.