"FATAL: invalid frontend message type 137"

ocharles commented 10 years ago

We're seeing this almost daily in our server logs, but unfortunately I have no idea to how to really reproduce it. It seems that under higher load, it seems to happen more frequently. #postgresql suggest it would be a concurrency problem, and our application does indeed have quite a lot of concurrency - I don't believe we are sharing connections between threads, but I'm going to verify that claim now.

lpsmith commented 10 years ago

Well, it should be safe to share postgresql-simple sockets between threads without locking... but there could well be bugs. Anyway, keep me informed when you find out more info.

lpsmith commented 10 years ago

Another idea, is you could run a packet capture on a postgres connection until this issue occurs, and then go back and examine the capture afterwards. Wireshark has a postgres protocol dissector, which would be quite useful for analyzing the capture.

lpsmith commented 10 years ago

Hmm, what version of postgresql-simple are you running? This didn't happen to start with postgresql-simple-0.4.3.0, did it? Because looking at your #postgresql conversation, the resource vanished exception message from Internal.exec should only be possible with 0.4.3.0. So it's not necessarily listen/notify causing the problem.

ocharles commented 10 years ago

I'm not sure when it's started, but I've certainly seen the invalid frontend message happen for months - so that's probably not a 4.3.0 thing. The error just happened again now, and I can confirm the server is compiled against 4.3.0. I've only reproduced it on the server, but I've seen it locally too - let me get the database imported locally and I'll see if I can Wireshark it.

ocharles commented 10 years ago

I have managed to get a minimal Wireshark capture. Here's the packet dump for just the connection that fails:

https://ocharles.org.uk/tmp/invalid-frontend-message.pcapng

Packet 3944 is the response "Invalid frontend message type 137"

You should be able to import this into Wireshark. I also have a much larger dump of the entire application, if you want to see what the other connections were doing.

I'll do some reading on this dump now and see if I can spot anything suspect, but I wanted to get this shared as soon as possible (and also to ensure I don't lose it!)

ocharles commented 10 years ago

It appears that "we" are somehow sending the following bytes:

89 04 70 69 6e 67                ..ping

I have no idea where this is coming from, but 89 is hexadecimal for 137 and that packet occurs immediately before PostgreSQL gets confused and closes the connection - so it would suggest this is the problem.

lpsmith commented 10 years ago

Well, I don't know of anything in postgresql-simple that would be violating the libpq abstraction and be sending that sort of data. So either it's a "bug" in libpq, or perhaps you have some C code somewhere that's sending some junk on the wrong file descriptor somehow? (Just spitballing here.)

I suppose it's possible to send junk on the file descriptor from your application without dropping down to C, via the Simple.Internal module. Are you using this module? And in particular, are you ever calling LibPQ.socket yourself?

Actually, connecting via SSL should be a decent way of diagnosing whether or not this string is being sent from libpq, or by another bit of code that should not be touching that file descriptor in that way.

If that string is originating from something other than libpq, then this should result in an SSL error, and is good evidence that the data is not being sent by pqsecure_write, which is an internal libpq function. This would strongly suggest something's accidentally using the Fd returned by PQ.socket or just treating some random integer as a file descriptor in some C code somewhere.

On the other hand, if you still get the invalid frontend message after turning on SSL, then this is strong evidence that that string is passing through pqsecure_write for unknown reasons.

(I do bring this up because, for a variety of reasons, I've thought about implementing the frontend/backend protocol in Haskell off and on since before the first release of postgresql-simple... but I really don't want to put the effort into supporting all the connection and authentication options. So I've wondered about using libpq to connect and authenticate and then basically take over the connection myself. Thanks in part to @rjmac's contributions to #25, I know how to do this now, and have written some (very preliminary and mostly unusable) code to that effect already.)

ocharles commented 10 years ago

It seems that it's probably websockets-snap that is causing this problem. 89 04 ping corresponds directly to a websockets ping, and we are using websockets. So I think I'm going to close this, as it doesn't seem to be a problem in your code.

My guess is the "ping" thread in websockets-snap (which fires every 15 seconds) is firing when the connection is closed, but a new socket has opened in the meantime on a different thread and got the same fd. I'm not entirely sure why that write would still succeed, but for now I've disabled that bit of code and will watch the logs for a year.

Thanks for teaching me about the pgsql protocol in wireshark though - I don't think I could have solved it without that!

dzhus commented 10 years ago

Do you think that the same websockets-snap pinger issue could cause no connection to the server errors as well?

Sep 22 10:34:47 carma1 carma[40706]: PostgresCRUD/insert {"tid":"ThreadId 8427462","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 10:47:03 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8477284","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 11:29:02 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8661478","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:02:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8775273","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:21:11 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8827198","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:54:05 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8949791","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:17:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9244298","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:48:28 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9350457","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}

I'm seeing lots of these in the logs of our application which uses postgresql-simple and snaplet-postgresql-simple for PG connection pools. websockets-snap are also used. It looks almost as if the resource-pool inside postgresql-simple provides connections in non-OK status (http://doxygen.postgresql.org/fe-exec_8c.html#af9a6a3980b0142764b437eabc6da1a3c)

lpsmith commented 10 years ago

@dzhus, are you aware of this issue? mightybyte/snaplet-postgresql-simple#6

If you suspect something like what ocharles observed, I would try to take a packet capture of database connections until you catch this error in action.

dzhus commented 10 years ago

@lpsmith Thanks for pointing this out, it seems to be at least tangential to the issue we're having.

lpsmith / postgresql-simple

"FATAL: invalid frontend message type 137" #117