upnext / BeaconControl

Setup and manage large beacon deployments with BeaconControl open source platform
https://beaconcontrol.io
BSD 3-Clause "New" or "Revised" License
92 stars 63 forks source link

App freezing when deployed in load-balanced setting #37

Open otrebmuh opened 7 years ago

otrebmuh commented 7 years ago

Hello,

I am trying to setup BeaconControl in a load-balanced setting. To do so, I "dockerized" the application in three different containers: one for the app, one for sidekiq and one for redis. I deploy sets of these three containers in one virtual machine on the cloud. I have changed the server from WebRick to thin. The database is postgres and it is deployed in another docker container in another virtual machine. I have also put nginx as the proxy / load balancer.

I have done initial tests with 4 VMs in this configuration with good results (1200 events / minute requests to the API). However, I have a problem. After some time, the thin servers in the virtual machine seem to freeze and stop responding. After a while, they unfreeze but at this point they can freeze again very easily.

I have ssh'd into one of the virtual machines where thin had frozen but was working again and used strace -s 99 -ffp (pid) to see the error trace in case it would freeze again, which happened.

What I see is the following:

`[pid 25740] ppoll([{fd=12, events=POLLIN|POLLPRI}], 1, {0, 999940000}, NULL, 8) = 0 (Timeout) [pid 25740] clock_gettime(CLOCK_MONOTONIC_RAW, {117303, 100765800}) = 0 [pid 25740] clock_gettime(CLOCK_MONOTONIC_RAW, {117303, 100850300}) = 0 [pid 25740] clock_gettime(CLOCK_MONOTONIC_RAW, {117303, 100927600}) = 0 [pid 25740] clock_gettime(CLOCK_MONOTONIC, {117303, 107380552}) = 0 [pid 25740] ppoll([{fd=12, events=POLLIN|POLLPRI}], 1, {0, 999923000}, NULL, 8) = 1 ([{fd=12, revents=POLLIN}], left {0, 166821281}) [pid 25740] epoll_wait(12, {{EPOLLIN, {u32=109458896, u64=94502274872784}}}, 4096, 0) = 1 [pid 25740] accept4(13, {sa_family=AF_INET, sin_port=htons(40552), sin_addr=inet_addr("40.112.151.221")}, [16], SOCK_CLOEXEC) = 14 [pid 25740] fcntl(14, F_GETFD) = 0x1 (flags FD_CLOEXEC) [pid 25740] fcntl(14, F_SETFD, FD_CLOEXEC) = 0 [pid 25740] fcntl(14, F_GETFL) = 0x2 (flags O_RDWR) [pid 25740] fcntl(14, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 25740] setsockopt(14, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 25740] accept4(13, 0x7ffe872f5280, [16], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable) [pid 25740] accept(13, 0x7ffe872f5280, [16]) = -1 EAGAIN (Resource temporarily unavailable) [pid 25740] clock_gettime(CLOCK_MONOTONIC_RAW, {117303, 935105800}) = 0 [pid 25740] epoll_ctl(12, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=96991248, u64=94502262405136}}) = 0 [pid 25740] clock_gettime(CLOCK_MONOTONIC_RAW, {117303, 935203000}) = 0 [pid 25740] clock_gettime(CLOCK_MONOTONIC, {117303, 941658869}) = 0 [pid 25740] ppoll([{fd=12, events=POLLIN|POLLPRI}], 1, {0, 165647000}, NULL, 8) = 1 ([{fd=12, revents=POLLIN}], left {0, 165644900}) [pid 25740] epoll_wait(12, {{EPOLLIN, {u32=96991248, u64=94502262405136}}}, 4096, 0) = 1 [pid 25740] read(14, "GET / HTTP/1.0\r\nHost: backend\r\nConnection: close\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla"..., 16384) = 1395 [pid 25740] getpeername(14, {sa_family=AF_INET, sin_port=htons(40552), sin_addr=inet_addr("xx.xx.xx.xx")}, [16]) = 0 [pid 25740] stat("/var/www/BeaconControl/public/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 25740] stat("/var/www/BeaconControl/public/.html", 0x7ffe872ef370) = -1 ENOENT (No such file or directory) [pid 25740] stat("/var/www/BeaconControl/public/index.html", 0x7ffe872ef370) = -1 ENOENT (No such file or directory) [pid 25740] clock_gettime(CLOCK_MONOTONIC, {117303, 942849266}) = 0 [pid 25740] clock_gettime(CLOCK_REALTIME, {1478135093, 863144723}) = 0 [pid 25740] clock_gettime(CLOCK_REALTIME, {1478135093, 863271823}) = 0 [pid 25740] open("/var/www/BeaconControl/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/var/www/BeaconControl/vendor/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/var/www/BeaconControl/app/forms/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/var/www/BeaconControl/app/validators/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/var/www/BeaconControl/app/uploaders/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

[pid 25740] open("/usr/local/bundle/gems/ansi-1.5.0/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/usr/local/bundle/gems/addressable-2.3.6/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 25740] open("/usr/local/bundle/gems/activerecord-4.2.0/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = 16 [pid 25740] fstat(16, {st_mode=S_IFREG|0644, st_size=3106, ...}) = 0 [pid 25740] close(16) = 0 [pid 25740] getuid() = 0 [pid 25740] geteuid() = 0 [pid 25740] getgid() = 0 [pid 25740] getegid() = 0 [pid 25740] open("/usr/local/bundle/gems/activerecord-4.2.0/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = 16 [pid 25740] fstat(16, {st_mode=S_IFREG|0644, st_size=3106, ...}) = 0 [pid 25740] ioctl(16, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffe872ee2c0) = -1 ENOTTY (Inappropriate ioctl for device) [pid 25740] read(16, "module ActiveRecord\n\n # Statement cache is used to cache a single statement in order to avoid crea"..., 8192) = 3106 [pid 25740] read(16, "", 8192) = 0 [pid 25740] close(16) = 0 [pid 25740] lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle", {st_mode=S_IFDIR|S_ISGID|0777, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle/gems", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle/gems/activerecord-4.2.0", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle/gems/activerecord-4.2.0/lib", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle/gems/activerecord-4.2.0/lib/active_record", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0 [pid 25740] lstat("/usr/local/bundle/gems/activerecord-4.2.0/lib/active_record/statement_cache.rb", {st_mode=S_IFREG|0644, st_size=3106, ...}) = 0 [pid 25740] sendto(15, "Q\0\0\0\rSELECT 1\0", 14, MSG_NOSIGNAL, NULL, 0) = 14 [pid 25740] poll([{fd=15, events=POLLIN|POLLERR}], 1, 4294967295 ` Curiously, the freezing only seems to happen to the BeaconControl instances that connect via network to the database. As I said previously, in the same VM where the DB container runs I have another instance of BeaconControl (only used for test purposes) and this one does not seem to freeze. Also, if I restart the docker container with the app, it works again until some time passes and then the same situation repeats. I am not an expert on Ruby and I would really appreciate if someone can shed some light on this. Thank you and best regards, Humberto
jkurdel commented 7 years ago

Hi Humberto

I guess it can be a problem with too many open connections to the database. Could you provide your web application log? Don't you see any ActiveRecord::ConnectionTimeoutError errors?

By the way errors like: [pid 25740] open("/var/www/BeaconControl/lib/active_record/statement_cache.rb", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) are connected with the way Rails loads constants.

Best, Jan

otrebmuh commented 7 years ago

Hello Jan,

Thank you very much for your response. It seems indeed I am getting some ActiveRecord::ConnectionTimeoutErrors. Do you have a suggestion on how to address this problem?

Also, I would like to ask you, is this problem the result of having multiple instances of BeaconControl connected to the same DB? If the answer is yes, then does this mean that such a configuration is not supported by BeaconControl?

Sorry for asking so many questions and thank you very much in advance for your answers.

Best regards,

Humberto