racket / racket-pkg-website

A frontend for the Racket Package Catalog.
Other
9 stars 15 forks source link

Unresponsive server #29

Open tonyg opened 7 years ago

tonyg commented 7 years ago

The racket-pkg-website part appears not to be taking new connections. This is the last output from the log (about 3 hours ago):

2016-12-21 15:18:29.735259500 custom-port-pipe-read: input port is closed
2016-12-21 15:18:29.735260500   context...:
2016-12-21 15:18:29.735344500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2016-12-21 15:18:29.735345500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2016-12-21 15:18:29.735348500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2016-12-21 15:18:29.735349500    .../more-scheme.rkt:261:28
2016-12-21 15:18:29.735349500    ...higher-order.rkt:340:33
2016-12-21 15:18:29.735349500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt:131:8
tonyg commented 7 years ago

The process is at 100% CPU and svc -du ... didn't do anything.

tonyg commented 7 years ago

SIGTERM also didn't do anything.

tonyg commented 7 years ago

SIGKILL has done the trick, and the server is restarting now.

tonyg commented 6 years ago

See this comment and this comment on #24, which relate more to this issue than that one.

It did its custom-port-pipe-read lockup-and-eventually-terminate dance again about 40 minutes ago. This time, it was only stuck at 100% CPU for about 26 seconds (timestamps UTC+1):

2018-06-03 11:08:35.468270500 custom-port-pipe-read: input port is closed
2018-06-03 11:08:35.468271500   context...:
2018-06-03 11:08:35.468272500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2018-06-03 11:08:35.468304500    [repeats 5 more times]
2018-06-03 11:08:35.468305500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:31:0
2018-06-03 11:08:35.468310500    ...higher-order.rkt:357:33
2018-06-03 11:08:35.468314500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt:131:8
2018-06-03 11:09:01.046719500 about to suspend in atomic mode
greghendershott commented 6 years ago

Does the "stack trace" point here? If so, here are some brainstorming thoughts, based on reading the code a little and what I (think) I know about HTTP.

Even if any of this is relevant, I'm not saying I see how it causes 100% CPU and a "lockup".

(I hope this is helpful, at all. If it's N/A and a distraction, my apologies!)

tonyg commented 6 years ago

I wrote this on slack the other day:

the racket-pkg-website lockups seem to be more frequent than I realised, and also might involve more than one problem. Since I added monitoring scripts a couple of days ago, I've seen a lockup-with-100%-CPU-usage happen several times, with durations ranging from tens of seconds to tens of minutes. Each time, the logs report "custom-port-pipe-read: input port is closed", followed by a period of 100% CPU usage, and some seconds/minutes later, "about to suspend in atomic mode" and termination. Any ideas what could be happening? The custom-port-pipe-read message looks to originate deep in the bowels of the C code. Could it be to do with the SSL wrappers around an accepted socket?

So, @greghendershott, I think we're thinking along similar lines. On the live server, Apache is doing SSL termination in front of Racket, but the Racket service is listening via SSL on localhost. So it's probably not raw user-originated requests hitting Racket, but instead probably-clean (?) https requests from Apache hitting Racket.

One thing we could try is disabling the SSL used internally between Apache and Racket, and using plain http for that step. I don't like this because then we'll be leaving the bug undiscovered and latent. But if we managed to reproduce the bug in a smaller situation, or if the production issues become too great, then we can switch to http and there's some chance that it will mitigate the issue.

mflatt commented 6 years ago

An SSL port is a "custom" port, so that much makes sense. But the "input port is closed" error means that close-input-port was called on a port (and not, say, that the network connection just behaved strangely).

While custodian-shutdown-all can close some kinds of ports, a "custom" input port isn't registered with a custodian that way, so I think close-input-port really had to be called from the runtime's perspective.