slack-ruby / slack-ruby-bot

The easiest way to write a Slack bot in Ruby.
MIT License
1.12k stars 187 forks source link

Memory Leak? #190

Open grudzien opened 6 years ago

grudzien commented 6 years ago

I have searched through the issues of both slack-ruby-bot and celluloid for issues of a memory leak and I haven't seen anything. I initially discovered an issue running a Slack Bot in AWS where my bot would eventually leak enough memory to be OOM killed by Ubuntu 16.04. We tried moving the bot from 256M to 512M to 1026M to 2048M and no matter how much we gave it, the bot would eventually consume all memory of the box. To simplify the issue I took the standard Ubuntu 16.04 image from AWS, patched it and installed ruby and the proper gems and ran the ping bot. In the last 24 hours it has gone from 54M of ram to 102M of ram. Here are the traits I have noticed:

  1. The web socket never reconnects.
  2. The ruby heap as reported by 'objspace' is not growing out of control (it's not growing at all)
  3. Running strace shows an mmap (allocation) every 2 minutes for the Slack health check of 524288 bytes
  4. munmap is called during GC freeing up those 524288 blocks but it allocates far faster than it frees.
  5. The bot is completely idle otherwise.
  6. The only thing in the debug log is a celluloid read and write every two minutes.
  7. The bot leaks memory in a linear fashion for about 8 hours then flatlines for 10-24 hours then continues to leak.
  8. Happens inside and outside of docker.
  9. Tested on ubuntu 16.04, 18.04, and Alpine Linux 2

I am trying to avoid radical troubleshooting like jemalloc and recompiling ruby with more debugging. If anyone has any suggestions or has experience with this I would appreciate the help. I am about one or two more days from ditching the project.

My current install (I have tried three different versions of ruby) OS: Ubuntu 16.04 4.4.0-1061-aws #70-Ubuntu Ruby: ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] Gems: activesupport (5.2.0) aws-eventstream (1.0.1) aws-partitions (1.94.0) aws-sdk-core (3.22.0) aws-sdk-dynamodb (1.8.0) aws-sigv4 (1.0.2) bigdecimal (1.2.8) binary_struct (2.1.0) bundler (1.11.2) celluloid (0.17.3) celluloid-essentials (0.20.5) celluloid-extras (0.20.5) celluloid-fsm (0.20.5) celluloid-io (0.17.3) celluloid-pool (0.20.5) celluloid-supervision (0.20.6) concurrent-ruby (1.0.5) contracts (0.16.0) did_you_mean (1.0.0) dry-configurable (0.7.0) dry-container (0.6.0) dry-core (0.4.7) dry-equalizer (0.2.1) dry-inflector (0.1.2) dry-logic (0.4.2) dry-types (0.13.2) dry-validation (0.12.0) faraday (0.15.2) faraday_middleware (0.12.2) gli (2.17.1) hashie (3.5.7) heapy (0.1.3) hitimes (1.3.0) httpclient (2.8.3) i18n (1.0.1) io-console (0.4.5) jmespath (1.4.0) json (2.1.0, 1.8.3) minitest (5.11.3, 5.8.4) molinillo (0.4.3) multipart-post (2.0.0) net-http-persistent (2.9.4) net-telnet (0.1.1) nio4r (2.3.1) power_assert (0.2.7) psych (2.0.17) rake (10.5.0) rdoc (4.2.1) slack-ruby-bot (0.10.5) slack-ruby-client (0.11.1) sysrandom (1.0.5) test-unit (3.1.7) thor (0.20.0, 0.19.1) thread_safe (0.3.6) timers (4.1.2) tss (0.5.0) tzinfo (1.2.5) websocket-driver (0.7.0) websocket-extensions (0.1.3) ztimer (0.6.0)

kstole commented 6 years ago

This sounds very similar to some issues I've experienced although I haven't looked very far into them. I have SlackRubyBot running in AWS as well and every so often the websocket will disconnect but the bot will stay running. So far, I've just solved it by restarting the bot, but I'd really like to get to the bottom of this. There was also one time where the websocket appeared to stay connected (according to Slack) but the bot wasn't responding to requests and when I checked the docker container, it said it was uusing 100% CPU (although maybe this was a one-off).

dblock commented 6 years ago

Likely related, https://github.com/slack-ruby/slack-ruby-client/issues/208

grudzien commented 6 years ago

I guess I should clarify my post. I stated the web socket is not reconnecting. What I meant was the bot is NOT disconnecting. I had thought it was a disconnect/reconnect issue but that does not appear to be happening. Its just a linear memory leak. I am still going through #208 to see if there are similarities.

edit I have been tracking the source port number for the last day and a half and it hasn't changed.

dblock commented 6 years ago

Oh so you have a bot that's online just fine that's leaking memory? That's not good :) I would find a way to dump the difference and see what objects are leaking (could be something in your code too).

dblock commented 6 years ago

I think https://stackoverflow.com/questions/20385767/finding-the-cause-of-a-memory-leak-in-ruby has pretty good information overall. I would aggressively GC.collect somewhere in the code/library and start dumping what's allocated to see a pattern.