publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
957 stars 1.83k forks source link

Memory issues (leak?) investigation #5817

Closed jywarren closed 4 years ago

jywarren commented 5 years ago

We're seeing a persistent memory issue since one week ago on Saturday, and I'm compiling information about it here to investigate.

Wondering if it's related to this controller method for the dashboard.

https://www.skylight.io/app/applications/GZDPChmcfm1Q/1559320320/1d/endpoints/HomeController%23dashboard?responseType=html

jywarren commented 5 years ago

Noting @icarito's comment:

I wonder jywarren because I had edited docker-compose-production.yml to use fewer processes (didn't make a PR for it). So it could be we just made it fit that way.

And this graph:

mdmmflaoadbbjepe(1)

jywarren commented 5 years ago

We're seeing a lot of SMTP test errors too:

Link: | https://intelligence.rackspace.com/cloud/entities/en45StuOyk/checks/chXoX9GHhF/alarm/alycd3HZyu

icarito commented 5 years ago

Yes load is very high too. From the htop and especially iotop it appears mailman is quite active. It's the culprit for sure! Prior to May 22th we ran it a few times a day - if we can run it every few minutes minute or so (not every second!) - it would be fine!

I, [2019-05-07T23:56:44.702410 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-08T21:33:03.762360 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T07:47:27.518491 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T08:18:47.825703 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T08:14:53.010705 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T21:45:50.739207 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-11T17:38:51.647335 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-13T03:33:15.682877 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:51:40.603184 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:53:20.857041 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:55:00.356772 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:56:40.487219 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-15T01:43:42.908744 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-16T10:13:45.703985 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-18T12:57:16.194957 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:27.019569 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:55.827419 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:18.722700 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:41.709075 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:00.124271 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:17.146210 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:33.745494 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:51.387282 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:09.145006 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:31.266559 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:03.176998 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:26.991989 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:54.074275 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:13.905343 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:37.736641 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:57.357057 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:15.522535 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:34.343241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:51.964241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:10.016964 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:42.822692 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:59.826809 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:16.178517 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:35.871196 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:59.731422 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:16.353160 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:33.608591 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:50.037296 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:06.912680 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:32.287362 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:59.201948 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:18.739067 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:42.144910 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:03.495556 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:20.493712 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:37.089192 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:53.921571 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:02:14.509227 #1]  INFO -- : Mailman v0.7.0 started

imagen

icarito commented 5 years ago

The log is filled with cycles of these, no error:

I, [2019-06-02T02:35:26.270644 #1]  INFO -- : Mailman v0.7.0 started
I, [2019-06-02T02:35:26.270851 #1]  INFO -- : Rails root found in ., requiring environment...
I, [2019-06-02T02:35:56.930267 #1]  INFO -- : POP3 receiver enabled (notifications@publiclab.org@pop.gmail.com).
I, [2019-06-02T02:35:56.938850 #1]  INFO -- : Polling enabled. Checking every 5 seconds.
icarito commented 5 years ago

Looks like mailman is crashing and being immediately respawn!

icarito@rs-plots2:/srv/plots_container/plots2$ docker ps
CONTAINER ID        IMAGE                COMMANDCREATED             STATUS              PORTS NAMES
8d13c675568e        containers_mailman   "script/mailman_serv…"4 days ago          Up 14 seconds containers_mailman_1
f423dec91ebe        containers_web       "/bin/bash -c 'sleep…"4 days ago          Up 4 days           127.0.0.1:4001->4001/tcp containers_web_1
24f7b43efebc        containers_sidekiq   "bundle exec sidekiq…"4 days ago          Up 4 days containers_sidekiq_1
070511ab43d1        redis:latest         "docker-entrypoint.s…"4 days ago          Up 4 days           6379/tcp containers_redis_1
6ea8f0498b2c        mariadb:10.2         "docker-entrypoint.s…"4 days ago          Up 3 days           3306/tcp containers_db_1

I've decided to stop this container for tonight in order to monitor the effect on performance.

jywarren commented 5 years ago

I think we may also look at what gems updatea were merged in the days leading up to this code publication. Thanks!

jywarren commented 5 years ago

That's so weird about mailman, I will look at the config but I don't remember any changes to the rate.

jywarren commented 5 years ago

Oh you know what? We set it to retry 3 times. Maybe these are overlapping now? It could at least have increased the rate of attempts since it retries 3 times for every scheduled run.

https://github.com/publiclab/plots2/blob/faf66c0b15473add33c10c47d57a6e7cc46ea669/script/mailman_server#L32

jywarren commented 5 years ago

Ok modified it for 20 seconds, which should mean max a retry every 5 seconds --

https://github.com/publiclab/plots2/commit/a40ea5650f2ce9ec80ee2324cea2d8c9bd98e382

That'll be the same rate as before when we added retries.

jywarren commented 5 years ago

OK, now working on analysis after a few hours:

https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559574420/6h/endpoints

Screen Shot 2019-06-03 at 4 36 39 PM

Overall looks good. But, on closer look, it's ramping up in load time:

Screen Shot 2019-06-03 at 4 37 03 PM

Comparing the latter portion where it's starting to go back up:

Screen Shot 2019-06-03 at 4 37 41 PM

to the earlier just after the reboot:

Screen Shot 2019-06-03 at 4 37 51 PM

And then to this from a couple weeks ago before all our trouble:

Screen Shot 2019-06-03 at 4 38 42 PM

Then finally just after we started seeing issues on the 22-23rd of May:

Screen Shot 2019-06-03 at 4 39 15 PM

Overall it's not conclusive.

Resources:

One of the tough things about this is that it's right around where these two commits happened:

  1. disabling caching on profiles (which we later reverted): https://github.com/publiclab/plots2/commit/794df370264a605935483aa8d0e4946bfd14c37c
  2. the container build process change: https://github.com/publiclab/plots2/commit/794df370264a605935483aa8d0e4946bfd14c37c

I'd like to think it relates to the addition of the retry 3 times code in https://github.com/publiclab/plots2/commit/2bc7b498ef3a05bc090ef26f316a30ec0104bcc6, which I tried tweaking today. But actually load times are still slowly growing.

This could mean that a) something else is driving it, or b) the "rescue/retry" cycle itself could be causing memory leak buildup?

jywarren commented 5 years ago

shall i comment out the rescue/retry code entirely?

maybe the hanging waiting for mysql to pick up is actually taking up threads?

I'll try this. Site is almost unresponsive.

I removed the retry here: https://github.com/publiclab/plots2/commit/faa5a12e86bf7944dca43134f649947f03ca96a6

Deploying... it'll take a while.

jywarren commented 5 years ago

Hmm it really doesn't seem solved... https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559577660/8h13m/endpoints

jywarren commented 5 years ago

Ok I wonder if the container setup affected the mailman container at all? Because at this point we've reverted all the likely stuff from the mailman script.

jywarren commented 5 years ago

OK, overnight it peaked and went back down a bit. But our problematic ones are still quite high, with peaks at about 20 seconds:

image

jywarren commented 5 years ago

The stats range calls are taking up to 40+ seconds!

jywarren commented 5 years ago

They're also taking forever on cache generation:

image

Could we be seeing an issue with the cache read/write?

jywarren commented 5 years ago

@icarito could there be like an issue on the read/write io or something on cache generation? I'm just not sure why it would take this long to pack all the data into the cache.

jywarren commented 5 years ago

Leaky gems -- check off if we're OK

Non-leaky but memory issues in any case:

jywarren commented 5 years ago

I'm still seeing this massive cache generation time for stats_controller#range and wondering if we need to tweak where cache is stored. It looks like the default is file storage (and I checked, we have cache files in /plots2/tmp/cache/. Would we be helped at all by switching to in-memory caching or memcached, both of which are apparently pretty simple changes?

https://guides.rubyonrails.org/v3.2/caching_with_rails.html#activesupport-cache-memorystore

image

jywarren commented 5 years ago

Also looking at https://www.skylight.io/support/performance-tips

jywarren commented 5 years ago

I'll look at the email configuration now but if it doesn't yield anything I'll merge this, turning off the begin/rescue loop: #5840

jywarren commented 5 years ago

OK our next step for https://github.com/publiclab/plots2/pull/5841 is to develop a monitoring strategy for if mailman goes down.

jywarren commented 5 years ago

Deploying with the new email credentials, AND the begin/rescue removal. However, I think it's worth redeploying with the begin/rescue re-instated if the memory leak is solved, because it could have been the email credential issues.

jywarren commented 5 years ago

Latest error:

mailman_1 | /app/app/models/comment.rb:265:in add_comment': undefined methodbody' for nil:NilClass (NoMethodError) mailman_1 | from /app/app/models/comment.rb:218:in receive_mail' mailman_1 | from script/mailman_server:31:inblock (2 levels) in <main>' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:in instance_exec' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:inroute' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:23:in block in process' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:33:inblock in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:38:in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:22:inprocess' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:43:in block in get_messages' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:ineach' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:in each_mail' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:42:inget_messages' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:133:in block in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:inloop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:83:inrun' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:11:in run' mailman_1 | from script/mailman_server:22:in<main>'

That's here:

https://github.com/publiclab/plots2/blob/e62bb49e30df79a9ddca5300579b80ff0903e3f4/app/models/comment.rb#L265

jywarren commented 5 years ago

Fixed that! https://github.com/publiclab/plots2/commit/c02b5600fa512108d44b9163f44764a9552bbb73

jywarren commented 5 years ago

ug, finally relaly publishing the comment.rb fix....

jywarren commented 5 years ago

OK, we're waiting to see if the email queue flushes out and we see some return to normalcy then...

jywarren commented 5 years ago

I left a comment on https://publiclab.org/notes/mimiss/06-04-2019/workshop-viii to test

icarito commented 5 years ago

Hi @jywarren I've been giving this a second look and have a theory.

First here is a graph for RAM use for the past 3 months: imagen

It is apparent from this graph that we have been growing in memory consumption for the past three months!

I went back a whole year: imagen

Aparently, in 2019, our application has increase its memory requirements quite a bit.

The theory is that following the trajectory of memory consumption we have been having, we may have reached a threshold where we have consumed available RAM and have begun relying on Swap, which is slowing things down considerably.

The memory increase could well be the size of some of our tables (rusers i'm looking at). This may have a relation to #5524 .

We will have to implement some optimizations, migrate the database to a different host, or add more RAM.

Pruning the database of spam users is also highly recommended.

icarito commented 5 years ago

I'm still leaning towards memory exhaustion due to app/site growth, which is causing high IO load due to swap memory "thrashing" to disk. I've checked our passenger-memory-stats from the web container and think that we can further reduce the process pool: imagen

I will try this as a first move to remediate performance.

icarito commented 5 years ago

I found that in Feb 2018 we had calculated that we could run 11 processes (because our app took 500mb to run). The formula is:

max_app_processes = (TOTAL_RAM * 0.75) / RAM_PER_PROCESS
                  = 6000Mb / 750Mb
                  = 8

but we are also running Skylightd, plus a process for fetching tweet comments, plus Sidekick, and also want to run mailman process. The majority of RAM use is in the web container: imagen

From both the above images I derive that we can still spare a one process, especially if this will make responses quicker.

Moving to 4 process pool size.

icarito commented 5 years ago

First optimization done. Promising first 30 minutes! imagen

jywarren commented 5 years ago

Ooh!

On Sat, Jun 8, 2019, 8:47 PM Sebastian Silva notifications@github.com wrote:

First optimization done. Promising first 30 minutes! [image: imagen] https://user-images.githubusercontent.com/199755/59154753-46635b00-8a3f-11e9-87b7-51e660e4a148.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5817?email_source=notifications&email_token=AAAF6J7GXQIQPVWFTWGYJRLPZR4KJA5CNFSM4HSA3N32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXIC5AQ#issuecomment-500182658, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAF6J65RCLAEFO6H6RJLSTPZR4KJANCNFSM4HSA3N3Q .

jywarren commented 5 years ago

OK, so mitigations list would be:

skilfullycurled commented 5 years ago

Hey @jywarren and @icarito,

First, (and I say this without joking): this thread actually turned out to be quite a good read. It had all the elements, a mystery, a hunt, dead ends, close calls, etc.

Anyway.

Regarding the rusers table relative to #5450 and #5524, there an enormous grouping in rusers that occurred between 4/26/13 - 1/1/14.

4/26/2013: 1366934400 1/1/2014: 1388534400 UID Range: 59627 - 420114 Users: 360466

Do you want to try the first query of the test run you described in #5450 on a portion of that group?

users who have not posted any node, comment, like, or subscription and have not logged in ever

As you said, this would be an easy query since not logging in ever would cover all of the criteria that came before it.

skilfullycurled commented 5 years ago

For reference on equivalent portion size to your proposed last 6 months in the other email: In the last month we've marked ~250 first-time postings as spam. So, let's say that in the last 6 months we had ~1500 banned users due to spam.

Oh, and I guess that brings up a good point. If you want to rid yourself of spam users you can just find all of the users who have content marked as spam and then delete the users who posted them.

As was briefly touched on in one of the issues, it might be good to have users with first-time-content marked as spam immediately deleted from the database.

icarito commented 5 years ago

Hi @skilfullycurled thank you for your input! So a majority of rusers is from 2013-2014 : That means to me that while it can help to reduce RAM usage, actually, our major tables are rsessions and impressions.

imagen

rsessions is over 30GB. @jywarren and @skilfullycurled - it would be great to come up with a strategy to reduce this and / or optimize queries using this table!

icarito commented 5 years ago

Also I think memcached isnt a good fit for this issue as it should consume more ram not less..

icarito commented 5 years ago

Although one can limit the memory use of memcached, I'll still try it!

icarito commented 5 years ago

Nope, from the docs above:

If you’re running multiple Ruby on Rails server processes (which is the case if you’re using mongrel_cluster or Phusion Passenger), then your Rails server process instances won’t be able to share cache data with each other. This cache store is not appropriate for large application deployments, but can work well for small, low traffic sites with only a couple of server processes or for development and test environments.

icarito commented 5 years ago

Look like it's not too hard to solve rsessions: https://stackoverflow.com/questions/10088619/how-to-clear-rails-sessions-table

@jywarren let's do this!

skilfullycurled commented 5 years ago

@icarito, I'm not sure this was ever done, but I had access to the database in 2016 and I notified everyone that the user sessions took up more space then the actual rest of the database by far. I was told they'd be flushed so either they were not or the problem remains that the database just continues to keep the sessions.

skilfullycurled commented 5 years ago

To give a feeling, as of 2016, the plots database compressed as bz2 was 1.9GB (no time right now to decompress for actual size), uncompressed, with the sessions removed, it was 518 MB

icarito commented 5 years ago

Thanks @skilfullycurled !!! I think I remember your input from 2016, I don't know how we missed flushing that, but our database dumps today are over 8GB compressed, likely mostly sessions. I'll wait for confirmation from @jywarren - I'd like to run the following in production today and then we can make it into a rake task or a cron job:

DELETE FROM sessions WHERE updated_at < DATE_SUB(NOW(), INTERVAL 1 DAY);

skilfullycurled commented 5 years ago

I got too curious, the uncompressed file is 6.8GB so minus the 518MB that puts us at 6.3GB. 😆

skilfullycurled commented 5 years ago

The rsessions is actually my favorite dataset that I have. It's completely use-less , but I just love that it's as large if not larger than datasets that I have that are use-ful! If anyone has any ideas for what to do with it, let me know!

jywarren commented 5 years ago

icarito (@icarito:matrix.org) got it from here https://stackoverflow.com/questions/10088619/how-to-clear-rails-sessions-table icarito (@icarito:matrix.org) it should log out every session that' not been active in the past day or week - we can tweak it

Just taking notes here. Sounds great.

icarito commented 5 years ago

Unstable seems to take a loong time... can try

DELETE ... FROM ... WHERE ... LIMIT x

And execute as many times as needed in production.

icarito commented 5 years ago

After 7 hours this is still ongoing in staging. Clearly this is not how we want to do this in production in one single batch. Another thing is that after deletion, the table will be fragmented and the file size will not decrease for rsessions table. The table needs to be dumped and recreated in order to release server resources.

icarito commented 5 years ago

My plan for doing this is the following: