racket / racket-pkg-website

A frontend for the Racket Package Catalog.
Other
11 stars 16 forks source link

Potential file descriptor leak #24

Open SuzanneSoy opened 8 years ago

SuzanneSoy commented 8 years ago

I tried updating my packages, and after the next index refresh, raco pkg install still used ye olde version. I went on the package's page to check, and it still showed the old commit hash, and (possibly after I refreshed the page?) there was the error message:

Checksum error : The package checksum does not match the package source code.

bug

I clicked again on the "refresh my packages" and got this error:

Exception
The application raised an exception with the message:

jsonp-rpc!: Illegal response to "https://localhost:9004/jsonp/update?callback=callback1473965235923": "<html><head><title>Servlet Error</title><link rel=\"stylesheet\" href=\"/error.css\"/></head><body><div class=\"section\"><div class=\"title\">Exception</div><p>The application raised an exception with the message:<pre>open-input-file: cannot open input file\n  path: /var/lib/pkgserver/pkg-index/users.new/georges.duperon@gmail.com\n  system error: Too many open files; errno=24</pre></p><p>Stack trace:<pre>call-with-input-file*61 at:\n  line 102, column 2, in file /home/pkgserver/racket/collects/racket/private/kw-file.rkt\n&lt;unknown procedure&gt; at:\n  line 49, column 5, in file /home/pkgserver/pkg-index/official/jsonp.rkt\n&lt;unknown procedure&gt; at:\n  line 40, column 2, in file /home/pkgserver/pkg-index/official/jsonp.rkt\n&lt;unknown procedure&gt; at:\n  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt\n&lt;unknown procedure&gt; at:\n  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt\n</pre></p></div></body></html>"
Stack trace:

<unknown procedure> at:
  line 129, column 35, in file /home/pkgserver/racket-pkg-website/src/site.rkt
<unknown procedure> at:
  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt
<unknown procedure> at:
  line 58, column 2, in file /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/dispatchers/dispatch-servlets.rkt
<unknown procedure> at:
  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt
<unknown procedure> at:
  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt
<unknown procedure> at:
  line 340, column 33, in file /home/pkgserver/racket/collects/racket/contract/private/arrow-higher-order.rkt
<unknown procedure> at:
  line 131, column 8, in file /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt

Feel free to close this if it was just a transient problem. I just thought it would be best to report the issue.

SuzanneSoy commented 8 years ago

Worth noting, http://pkgs.racket-lang.org says:

package build status being checked for updates

SuzanneSoy commented 8 years ago

And the Checksum field of that package on https://pkgs.racket-lang.org contains:

Checksum:   e5178c7bd062d6a27300b223a6d0442633421fa1
Update Error:
tcp-connect: connection failed
  detail: host not found
  address: github.com
  port number: 443
  step: 1
  system error: System error; errno=-11
  context...:
   /home/pkgserver/racket/collects/openssl/mzssl.rkt:1670:0: ssl-connect150
   /home/pkgserver/racket/collects/openssl/../racket/contract/private/arrow-val-first.rkt:306:25
   /home/pkgserver/racket/collects/net/http-client.rkt:235:0
   /home/pkgserver/racket/collects/racket/contract/private/arrow-val-first.rkt:397:3
   /home/pkgserver/racket/collects/net/url.rkt:144:0: http://getpost-impure-port
   /home/pkgserver/racket/collects/net/url.rkt:214:0: getpost-pure-port
   /home/pkgserver/racket/collects/net/../racket/contract/private/arrow-val-first.rkt:306:25
   /home/pkgserver/racket/collects/net/git-checkout.rkt:264:0: done-step
   /home/pkgserver/racket/collects/net/git-checkout.rkt:58:8
   /home/pkgserver/racket/collects/net/git-checkout.rkt:39:2: retry-loop
   /home/pkgserver/racket/collects/pkg/private/download.rkt:101:2: download!
   /home/pkgserver/racket/collects/pkg/private/download.rkt:93:0: download-repo!24
   /home/pkgserver/racket/collects/pkg/private/stage.rkt:299:11
   /home/pkgserver/racket/collects/pkg/private/stage.rkt:106:0: stage-package/info46
   /home/pkgserver/racket/collects/pkg/private/stage.rkt:693:0: pkg-stage62
   /home/pkgserver/racket/collects/pkg/private/content.rkt:39:0: get-pkg-content20
   ...
tonyg commented 8 years ago

Thanks for the report. There's clearly a file-descriptor leak in the system. Unfortunately, I idiotically restarted the service just now before I'd properly gathered enough information about the problem :-( so we'll have to wait for it to happen again! On the plus side, now is probably a good time to retry the "update my packages" option - although, of course, after the restart, it is happening automatically as part of a full database refresh.

SuzanneSoy commented 8 years ago

Thanks for restarting it anyway. Shall I leave this issue open until someone else hits the problem, or better close it now and re-open later?

tonyg commented 8 years ago

Leaving it open is a good idea I think. Thanks!

tonyg commented 7 years ago

Just restarted the service after it had exhausted its fds. Looking at /proc, all the fds but a handful were connected sockets. I neglected to use netstat to investigate the sockets themselves! Oops. Sigh. Next time.

tonyg commented 6 years ago

A year and more has passed

We are still seeing occasional lockups, some of which are probably fd exhaustion errors.

I have recently added a cron job which monitors the service, capturing (a) stack traces of running threads and (b) recent log entries upon service failure, before restarting the service.

It fired last night! However, from looking at the logs, it doesn't seem to be the fd exhaustion issue? Or at least, it's not immediately apparent that it's fd exhaustion the way it has been for other instances of it.

The facts:

Here is an excerpt from the logs before and immediately after the outage (NB. time zone in the excerpt is UTC):

2018-06-01 22:21:24.481509500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:21:25.296628500 Fetched package list containing 1244 packages.
2018-06-01 22:26:24.481738500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:26:25.299457500 Fetched package list containing 1244 packages.
2018-06-01 22:31:24.482254500 Fetching package list from file:///var/lib/pkgserver/public_html/pkg-index-static/pkgs-all.json.gz
2018-06-01 22:31:24.981810500 custom-port-pipe-read: input port is closed
2018-06-01 22:31:24.981812500   context...:
2018-06-01 22:31:24.981812500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:185:2: read-header
2018-06-01 22:31:24.981842500    [repeats 3 more times]
2018-06-01 22:31:24.981843500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/request.rkt:31:0
2018-06-01 22:31:24.981843500    ...higher-order.rkt:357:33
2018-06-01 22:31:24.981844500    /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/private/dispatch-server-with-connect-unit.rkt:131:8
2018-06-01 22:36:25.559150500 about to suspend in atomic mode
2018-06-01 22:36:25.682688500 =============================================
2018-06-01 22:36:26.702934500   [loading /home/pkgserver/racket-pkg-website/src/site.rkt]
2018-06-01 22:36:26.738319500   [loading /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/compiled/servlet_rkt.zo]
2018-06-01 22:36:26.739589500   [loading /home/pkgserver/racket/share/pkgs/web-server-lib/web-server/http/compiled/bindings_rkt.zo]

Operation is normal until 22:31:24, at which point an exception is raised and the process becomes unresponsive for five minutes exactly before spitting out about to suspend in atomic mode and finally responding to the signals delivered to it by the service restart triggered by the cron job.

This is not what I expected at all. Previously, I had seen past log entries remarking (repeatedly) that tcp-accept could not do its job because of a lack of file descriptors. But here we have no apparent fd issue, and instead a total lockup lasting exactly five minutes.

Are there two bugs, or just one? I suspect two. Hopefully as the service monitor script does its job we will learn more.

tonyg commented 6 years ago

It's locked up again right now - well, rather, it's unresponsive (same "custom-port-pipe-read" log entry as before, but no "about to suspend in atomic mode" yet) and using 100% CPU. Only about 52 fds open, nothing unusual. Attaching with GDB gives unhelpful results - probably in JITted code?

(gdb) bt
#0  0x00007ffe16a57d9c in ?? ()
#1  0x00007ffe16a1d748 in ?? ()
#2  0x00007ffe16a1d888 in ?? ()
#3  0x0000000000000001 in ?? ()
#4  0xbff0000000000000 in ?? ()
#5  0x0000000000000000 in ?? ()

Eventually, while in GDB, the process gets a SIGTERM (presumably from svc?) and would have terminated if I hadn't been attached to it.

   0x7ffe16a57d84:      mov    %rdx,0x8(%rdi)
   0x7ffe16a57d88:      test   %rsi,%rsi
   0x7ffe16a57d8b:      jne    0x7ffe16a57f29
   0x7ffe16a57d91:      xor    %eax,%eax
   0x7ffe16a57d93:      test   %r10,%r10
   0x7ffe16a57d96:      jne    0x7ffe16a57d9c
   0x7ffe16a57d98:      mov    $0x60,%al
   0x7ffe16a57d9a:      syscall 
=> 0x7ffe16a57d9c:      add    $0x8,%rsp
   0x7ffe16a57da0:      pop    %rbx
   0x7ffe16a57da1:      pop    %r12
   0x7ffe16a57da3:      pop    %r13
   0x7ffe16a57da5:      pop    %r14
   0x7ffe16a57da7:      pop    %r15
   0x7ffe16a57da9:      pop    %rbp
   0x7ffe16a57daa:      retq   
(gdb) info registers
rax            0x0      0
rbx            0x7b     123
rcx            0xffffffffffffffff       -1
rdx            0x6cb38  445240
rsi            0x0      0
rdi            0x7ffe16a1d7c0   140729278126016
rbp            0x7ffe16a1d7b0   0x7ffe16a1d7b0
rsp            0x7ffe16a1d780   0x7ffe16a1d780
r8             0x5b12af40       1527951168
r9             0x1bd5960        29186400
r10            0x0      0
r11            0x246    582
r12            0xffffffffffdff000       -2101248
r13            0x0      0
r14            0x7005c034       1879425076
r15            0x2      2
rip            0x7ffe16a57d9c   0x7ffe16a57d9c
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) info thread
  Id   Target Id         Frame 
* 1    process 32206 "live.rkt" 0x00007ffe16a57d9c in ?? ()
tonyg commented 6 years ago

OK - the symptoms I've seen so far, with the monitoring scripts in place, look more like #29 than this issue, so I'll continue analysis of that branch over there. I'll return here if/when the fd exhaustion thing manifests again.

guygastineau commented 3 years ago

I just found this, as I am having the same problem. I don't know when it happened exactly. I was expecting a prototype of an API/brain to be running ~20 days later, but it wasn't.

Since, I had it running for a simple test it was just in a tmux session. When I reattached to the session it was blasting some "too many open files" message as fast as it could print to the tmux buffer. The weird thing is, access to the https server is restricted at another level by IP, and there is simply no reason for it to have had very many FDs open at that time.

I reacted to this jarring scene with Ctrl-c before I had the good sense to investigate further. I think I read #29 earlier today. It appears like this is related to a custom-port-pipe-read. In that issue (iirc) the problem was maybe identified as related to the custom ports for SSL. I am using certs directly with the racket web server.

Do you, @tonyg, think that it might be related to the #:ssl functionality of the Racker web server library too?

Since I need my API in immediate use I am going to handle the TLS with Nginx. If I haven't been back here in a year or so feel free to ping me. If I haven't had any more of these crashes I might forget to come back and say so.