lpsmith / postgresql-simple

Mid-level client library for accessing PostgreSQL from Haskell
Other
206 stars 71 forks source link

Resource vanished (failed to fetch file descriptor) problem #192

Closed mrkkrp closed 8 years ago

mrkkrp commented 8 years ago

We have been using the persistent library that's built on top of postgresql-simple for some time now without problems. However, recently we tried to switch to a newer resolver and our program stopped working.

Experimenting with different resolvers, I have found out that with nightly-2016-08-12 the program works just fine, while with nightly-2016-08-13 it produces the following error messages:

FATAL:  invalid frontend message type 21
Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor)

and sometimes

libpq: failed (server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
)

Repeatedly switching between the resolvers I confirmed that it's not an accident (the process takes 3 hours and finishes normally with nightly-2016-08-12, with nightly-2016-08-13 it fails after 1 second) and something has definitely changed between these releases and introduced that problem.

Now if we look at the diff between these resolvers, there is nothing related to persistent or postgresql-simple:

https://www.stackage.org/diff/nightly-2016-08-12/nightly-2016-08-13

So I'm a bit confused, what could cause these error messages, can you comment on these and maybe point out a direction for further investigation?

Thanks.

lpsmith commented 8 years ago

Well, the issue is that in the first case, libpq is receiving some junk on the socket it's connecting to, and terminating the connection, whereas the other side of the connection is terminating the connection in the second case. (Perhaps because it received some "junk"?)

Are you sure you are connecting to a postgresql server? You aren't connecting to something that is not a postgres server, are you?

I'd quickly examine the connection strings that are reaching connectPostgreSQL for differences. It may also be worth investigating the environment variables that affect that function.

If you don't spot something obvious rather immediately on that count, then it's definitely time to perform a packet capture (make sure SSL is disabled, if at all possible) and view it using wireshark.

mrkkrp commented 8 years ago

I checked the connection stirngs and they are identical. Then I used Wireshark and saw how persistent checks db to perform migration, then several queries from my application (I think a hundred maybe) and they worked OK, then finally for some reason the responses from Postrgesql changed to this:

postgres-tricky

mrkkrp commented 8 years ago

Anyway, thank you for your answer, I think it's near to impossible for you to help me further, so I'm closing the issue.

lpsmith commented 8 years ago

Oh, ok, so this is always a message from the backend. There was a similar problem in #117 which is a use-after-close file descriptor fault that's never been completely resolved, to my knowledge. This certainly seems like it could be another use-after-close fault in some other haskell package.

I'd take a closer look at the messages leading up to this. It also seems as though this capture encompasses multiple connections, so you want to learn how to filter the traffic that's specific to this connection, and also be sure that you are not filtering traffic that doesn't look like postgres traffic.

Incidentally, #117 is what convinced me to add use-after-close fault protection to linux-inotify, and the techniques I developed there got adopted by the Notification module, which solved at least one race condition I was suspicious of before, and another race condition I had no idea existed, see cocreature/hasql-notifications#1.

So while you are correct that it's unlikely I can help you too much further, I would appreciate being kept in the loop as you learn more about this problem. And there is actually a decent chance that can have further unexpected positive benefits.

mrkkrp commented 8 years ago

Sure. It will be interesting to understand what's going on and why it works with nightly-2016-08-12 and stops working with more recent resolvers. My application is multi-threaded and it uses connection pool shared among threads, maybe this is why it looks like multiple connections? About filtering: when I ran Wireshark today, I just filtered all messages with PGSQL protocol.

lpsmith commented 8 years ago

Yeah, you definitely want to adjust your filtering, because you want to see what the backend received that caused it to terminate the connection. Multithreaded apps are decidedly more vulnerable to use-after-close descriptor faults.

lpsmith commented 8 years ago

Another differential diagnosis for use-after-close faults is enabling SSL; if you start getting SSL errors instead of this error, then you know it's just about has to be use-after-close.

tomjaguarpaw commented 4 years ago

I'm seeing this a lot in Opaleye property tests. It occurs something like once in every ten runs. Each run creates its own independent database and connection, and does roughly 10,000 database queries, so I am seeing this one in every 100,000 queries, approximately. There's nothing obvious that I changed that I suspect triggered this, although it did start happening after I started increasing the complexity of the random queries that I generate.

tomjaguarpaw commented 4 years ago

I don't use threading (as far as I know -- perhaps QuickCheck does). I doubt the connection is being closed: I don't even close the connection in the test code at all, I just let program exit clean up.

lpsmith commented 4 years ago

Unfortunately, it doesn't seem likely that the surprisingly quick fix to tomjaguarpaw/haskell-opaleye#474 is directly relevant here: as the packet capture shows a error message being passed back to libpq, after which (presumably) the backend closes the connection, this is almost certainly not a segfault. Whereas I seriously doubt that @tomjaguarpaw would find any error message if he were to run a packet capture on the offending query.