Open SuzanneSoy opened 8 years ago
Worth noting, http://pkgs.racket-lang.org says:
package build status being checked for updates
And the Checksum field of that package on https://pkgs.racket-lang.org contains:
Checksum: e5178c7bd062d6a27300b223a6d0442633421fa1
Update Error:
tcp-connect: connection failed
detail: host not found
address: github.com
port number: 443
step: 1
system error: System error; errno=-11
context...:
/home/pkgserver/racket/collects/openssl/mzssl.rkt:1670:0: ssl-connect150
/home/pkgserver/racket/collects/openssl/../racket/contract/private/arrow-val-first.rkt:306:25
/home/pkgserver/racket/collects/net/http-client.rkt:235:0
/home/pkgserver/racket/collects/racket/contract/private/arrow-val-first.rkt:397:3
/home/pkgserver/racket/collects/net/url.rkt:144:0: http://getpost-impure-port
/home/pkgserver/racket/collects/net/url.rkt:214:0: getpost-pure-port
/home/pkgserver/racket/collects/net/../racket/contract/private/arrow-val-first.rkt:306:25
/home/pkgserver/racket/collects/net/git-checkout.rkt:264:0: done-step
/home/pkgserver/racket/collects/net/git-checkout.rkt:58:8
/home/pkgserver/racket/collects/net/git-checkout.rkt:39:2: retry-loop
/home/pkgserver/racket/collects/pkg/private/download.rkt:101:2: download!
/home/pkgserver/racket/collects/pkg/private/download.rkt:93:0: download-repo!24
/home/pkgserver/racket/collects/pkg/private/stage.rkt:299:11
/home/pkgserver/racket/collects/pkg/private/stage.rkt:106:0: stage-package/info46
/home/pkgserver/racket/collects/pkg/private/stage.rkt:693:0: pkg-stage62
/home/pkgserver/racket/collects/pkg/private/content.rkt:39:0: get-pkg-content20
...
Thanks for the report. There's clearly a file-descriptor leak in the system. Unfortunately, I idiotically restarted the service just now before I'd properly gathered enough information about the problem :-( so we'll have to wait for it to happen again! On the plus side, now is probably a good time to retry the "update my packages" option - although, of course, after the restart, it is happening automatically as part of a full database refresh.
Thanks for restarting it anyway. Shall I leave this issue open until someone else hits the problem, or better close it now and re-open later?
Leaving it open is a good idea I think. Thanks!
Just restarted the service after it had exhausted its fds. Looking at /proc
, all the fds but a handful were connected sockets. I neglected to use netstat
to investigate the sockets themselves! Oops. Sigh. Next time.
We are still seeing occasional lockups, some of which are probably fd exhaustion errors.
I have recently added a cron job which monitors the service, capturing (a) stack traces of running threads and (b) recent log entries upon service failure, before restarting the service.
It fired last night! However, from looking at the logs, it doesn't seem to be the fd exhaustion issue? Or at least, it's not immediately apparent that it's fd exhaustion the way it has been for other instances of it.
The facts:
Here is an excerpt from the logs before and immediately after the outage (NB. time zone in the excerpt is UTC):
2018-06-01 22:21:24.481509500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:21:25.296628500 Fetched package list containing 1244 packages.
2018-06-01 22:26:24.481738500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:26:25.299457500 Fetched package list containing 1244 packages.
2018-06-01 22:31:24.482254500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:31:24.981810500 custom-port-pipe-read: input port is closed
2018-06-01 22:31:24.981812500 context...:
2018-06-01 22:31:24.981812500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2018-06-01 22:31:24.981842500 [repeats 3 more times]
2018-06-01 22:31:24.981843500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:31:0
2018-06-01 22:31:24.981843500 ...higher-order.rkt:357:33
2018-06-01 22:31:24.981844500 /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt:131:8
2018-06-01 22:36:25.559150500 about to suspend in atomic mode
2018-06-01 22:36:25.682688500 =============================================
2018-06-01 22:36:26.702934500 [loading /home/pkgserver/racket-pkg-website/src/site.rkt]
2018-06-01 22:36:26.738319500 [loading /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/compiled/servlet_rkt.zo]
2018-06-01 22:36:26.739589500 [loading /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/compiled/bindings_rkt.zo]
Operation is normal until 22:31:24, at which point an exception is raised and the process becomes unresponsive for five minutes exactly before spitting out about to suspend in atomic mode
and finally responding to the signals delivered to it by the service restart triggered by the cron job.
This is not what I expected at all. Previously, I had seen past log entries remarking (repeatedly) that tcp-accept
could not do its job because of a lack of file descriptors. But here we have no apparent fd issue, and instead a total lockup lasting exactly five minutes.
Are there two bugs, or just one? I suspect two. Hopefully as the service monitor script does its job we will learn more.
It's locked up again right now - well, rather, it's unresponsive (same "custom-port-pipe-read
" log entry as before, but no "about to suspend in atomic mode
" yet) and using 100% CPU. Only about 52 fds open, nothing unusual. Attaching with GDB gives unhelpful results - probably in JITted code?
(gdb) bt
#0 0x00007ffe16a57d9c in ?? ()
#1 0x00007ffe16a1d748 in ?? ()
#2 0x00007ffe16a1d888 in ?? ()
#3 0x0000000000000001 in ?? ()
#4 0xbff0000000000000 in ?? ()
#5 0x0000000000000000 in ?? ()
Eventually, while in GDB, the process gets a SIGTERM (presumably from svc
?) and would have terminated if I hadn't been attached to it.
0x7ffe16a57d84: mov %rdx,0x8(%rdi)
0x7ffe16a57d88: test %rsi,%rsi
0x7ffe16a57d8b: jne 0x7ffe16a57f29
0x7ffe16a57d91: xor %eax,%eax
0x7ffe16a57d93: test %r10,%r10
0x7ffe16a57d96: jne 0x7ffe16a57d9c
0x7ffe16a57d98: mov $0x60,%al
0x7ffe16a57d9a: syscall
=> 0x7ffe16a57d9c: add $0x8,%rsp
0x7ffe16a57da0: pop %rbx
0x7ffe16a57da1: pop %r12
0x7ffe16a57da3: pop %r13
0x7ffe16a57da5: pop %r14
0x7ffe16a57da7: pop %r15
0x7ffe16a57da9: pop %rbp
0x7ffe16a57daa: retq
(gdb) info registers
rax 0x0 0
rbx 0x7b 123
rcx 0xffffffffffffffff -1
rdx 0x6cb38 445240
rsi 0x0 0
rdi 0x7ffe16a1d7c0 140729278126016
rbp 0x7ffe16a1d7b0 0x7ffe16a1d7b0
rsp 0x7ffe16a1d780 0x7ffe16a1d780
r8 0x5b12af40 1527951168
r9 0x1bd5960 29186400
r10 0x0 0
r11 0x246 582
r12 0xffffffffffdff000 -2101248
r13 0x0 0
r14 0x7005c034 1879425076
r15 0x2 2
rip 0x7ffe16a57d9c 0x7ffe16a57d9c
eflags 0x246 [ PF ZF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) info thread
Id Target Id Frame
* 1 process 32206 "live.rkt" 0x00007ffe16a57d9c in ?? ()
OK - the symptoms I've seen so far, with the monitoring scripts in place, look more like #29 than this issue, so I'll continue analysis of that branch over there. I'll return here if/when the fd exhaustion thing manifests again.
I just found this, as I am having the same problem. I don't know when it happened exactly. I was expecting a prototype of an API/brain to be running ~20 days later, but it wasn't.
Since, I had it running for a simple test it was just in a tmux session. When I reattached to the session it was blasting some "too many open files" message as fast as it could print to the tmux buffer. The weird thing is, access to the https server is restricted at another level by IP, and there is simply no reason for it to have had very many FDs open at that time.
I reacted to this jarring scene with Ctrl-c
before I had the good sense to investigate further. I think I read #29 earlier today. It appears like this is related to a custom-port-pipe-read
. In that issue (iirc) the problem was maybe identified as related to the custom ports for SSL. I am using certs directly with the racket web server.
Do you, @tonyg, think that it might be related to the #:ssl
functionality of the Racker web server library too?
Since I need my API in immediate use I am going to handle the TLS with Nginx. If I haven't been back here in a year or so feel free to ping me. If I haven't had any more of these crashes I might forget to come back and say so.
I tried updating my packages, and after the next index refresh, raco pkg install still used ye olde version. I went on the package's page to check, and it still showed the old commit hash, and (possibly after I refreshed the page?) there was the error message:
I clicked again on the "refresh my packages" and got this error:
Feel free to close this if it was just a transient problem. I just thought it would be best to report the issue.