Open tonyg opened 7 years ago
The process is at 100% CPU and svc -du ...
didn't do anything.
SIGTERM
also didn't do anything.
SIGKILL
has done the trick, and the server is restarting now.
See this comment and this comment on #24, which relate more to this issue than that one.
It did its custom-port-pipe-read
lockup-and-eventually-terminate dance again about 40 minutes ago. This time, it was only stuck at 100% CPU for about 26 seconds (timestamps UTC+1):
2018-06-03 11:08:35.468270500 custom-port-pipe-read: input port is closed
2018-06-03 11:08:35.468271500 context...:
2018-06-03 11:08:35.468272500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2018-06-03 11:08:35.468304500 [repeats 5 more times]
2018-06-03 11:08:35.468305500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:31:0
2018-06-03 11:08:35.468310500 ...higher-order.rkt:357:33
2018-06-03 11:08:35.468314500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt:131:8
2018-06-03 11:09:01.046719500 about to suspend in atomic mode
Does the "stack trace" point here? If so, here are some brainstorming thoughts, based on reading the code a little and what I (think) I know about HTTP.
Could this be due to a malformed request that impacts reading headers?
Is the pkg site's Racket web-server behind some other server like Apache or nginx which might check headers (?), or, is it direct/"raw"?
Note that headers are read at the start of a request. But also: If a request is Transfer-Encoding: chunked
, then after the chunk data/body is read, there can be more headers at the end. The Racket web server does try to read these, too.
Speaking of chunked encoding: In that case there will be no Content-Length
header before the data. I notice there's something going on with adjusting the connection timeout based on the Content-Length
here. I don't understand what that means, but I want to point out that this won't happen -- the default timeout will be used -- in the chunked transfer scenario.
Even if any of this is relevant, I'm not saying I see how it causes 100% CPU and a "lockup".
(I hope this is helpful, at all. If it's N/A and a distraction, my apologies!)
I wrote this on slack the other day:
the racket-pkg-website lockups seem to be more frequent than I realised, and also might involve more than one problem. Since I added monitoring scripts a couple of days ago, I've seen a lockup-with-100%-CPU-usage happen several times, with durations ranging from tens of seconds to tens of minutes. Each time, the logs report "
custom-port-pipe-read: input port is closed
", followed by a period of 100% CPU usage, and some seconds/minutes later, "about to suspend in atomic mode
" and termination. Any ideas what could be happening? Thecustom-port-pipe-read
message looks to originate deep in the bowels of the C code. Could it be to do with the SSL wrappers around an accepted socket?
So, @greghendershott, I think we're thinking along similar lines. On the live server, Apache is doing SSL termination in front of Racket, but the Racket service is listening via SSL on localhost. So it's probably not raw user-originated requests hitting Racket, but instead probably-clean (?) https requests from Apache hitting Racket.
One thing we could try is disabling the SSL used internally between Apache and Racket, and using plain http for that step. I don't like this because then we'll be leaving the bug undiscovered and latent. But if we managed to reproduce the bug in a smaller situation, or if the production issues become too great, then we can switch to http and there's some chance that it will mitigate the issue.
An SSL port is a "custom" port, so that much makes sense. But the "input port is closed" error means that close-input-port
was called on a port (and not, say, that the network connection just behaved strangely).
While custodian-shutdown-all
can close some kinds of ports, a "custom" input port isn't registered with a custodian that way, so I think close-input-port
really had to be called from the runtime's perspective.
The
racket-pkg-website
part appears not to be taking new connections. This is the last output from the log (about 3 hours ago):