sudoroom / sudo-infrastructure

Tracking issues related to sudoroom's infrastructure (web servers, wiki, mailing lists, etc)
2 stars 1 forks source link

sudoroom.org keeps going offline #6

Closed rcsheets closed 6 years ago

rcsheets commented 6 years ago

Over the past few days/weeks, we have noticed that our website stops responding pretty often.

rcsheets commented 6 years ago

Today this happened again and I was able to grab a screenshot of the console. sudoroom-vps-broken-1

Not sure if this was the same kind of thing @yardenac was seeing earlier.

rcsheets commented 6 years ago

The console responds to keypresses, but I can't get a getty to show up and SSH also isn't working, so I'm rebooting the box.

rcsheets commented 6 years ago

This has happened a couple more times today, and I've been able to collect a little more data. I've installed netdata on sudoroom.org and configured it to send collected data to room.sudoroom.org.

Using Digital Ocean's monitoring agent, it appeared that there was no significant spike in resource utilization (e.g. memory) immediately before a crash. However, it appears that the spike is simply so rapid that it wasn't reported by the Digital Ocean monitoring system, since it only collects data once a minute. Netdata shows that the system goes from OK to broken in just a few seconds.

The most recent crash happened today. It began at 2:06:50 PST and had rendered the system unusable by 2:06:56 PST.

I'm still trying to understand what exactly went wrong.

rcsheets commented 6 years ago

It seems there's still not quite enough data to isolate exactly what went wrong. I believe this is because Apache doesn't log requests until the response is complete, and whatever is triggering this issue is probably a request that doesn't complete. I've enabled forensic logging, which records requests before they are processed. Next time the system crashes, we should be able to see which requests were outstanding at the time of the crash, assuming the log file gets written successfully. Since it seems to take at least a few seconds for the problematic requests to crash the system, I'm expecting we'll get useful forensic logs.

rcsheets commented 6 years ago

Crashed again at around 10:36 PST. I was tail -Fing forensic.log at the time of the crash, and there are some entries there that didn't survive the reboot. I've saved these at /home/rcsheets/forensic-log-lost-entries-2018-01-02-10-36.log. If there is a problematic request, it may be in that log.

rcsheets commented 6 years ago

According to the logs we have, the following requests were outstanding at the time of the crash (log entries truncated here for ease of readability in the issue log):

+31039:5a4bd12b:3768|GET / HTTP/1.1|Host:sudoroom.org
+31039:5a4bd12c:3769|GET /mediawiki/index.php?title=Mesh/Blog&action=feed&feed=atom HTTP/1.1|Host:sudoroom.org
+31039:5a4bd12e:376a|GET /wiki/User_talk%3a178.154.243.79 HTTP/1.1|Host:sudoroom.org
+31039:5a4bd12f:376b|GET /wp-content/uploads/2015/08/IMG_3195.jpg HTTP/1.1|Host:sudoroom.org
+31107:5a4bd130:5452|GET /?title=GIT_Version_Control_for_Non_Coders&action=edit&mobileaction=toggle_view_mobile HTTP/1.1|Host:sudoroom.org
+31039:5a4bd131:376c|GET / HTTP/1.1|Host:sudoroom.org
+31107:5a4bd134:5453|GET /?em_ess=1&event_id=5500 HTTP/1.1|Host:bayareapublicschool.org
+31039:5a4bd135:376d|GET /mediawiki/index.php?title=Special%3aWhatLinksHere/File%3aSudoRoom.png&limit=100&hidelinks=1 HTTP/1.1|Host:sudoroom.org
+31039:5a4bd136:376e|GET /lists/listinfo/members HTTP/1.1|Host:sudoroom.org
+31039:5a4bd143:376f|GET /pipermail/mesh/2013-July/000279.html HTTP/1.1|Host:sudoroom.org
+31107:5a4bd14d:5454|GET /wp-content/uploads/2014/10/agua-viva-ND-lispector-tobler.pdf HTTP/1.1|Host:bayareapublicschool.org
+31039:5a4bd151:3770|GET / HTTP/1.1|Host:sudoroom.org
+31039:5a4bd15c:3771|GET /?title=Hackpack&action=edit&oldid=7062 HTTP/1.1|Host:sudoroom.org
+31107:5a4bd15e:5455|GET /lists/listinfo/controllers HTTP/1.1|Host:sudoroom.org
+31107:5a4bd162:5456|GET / HTTP/1.1|Host:sudoroom.org
+31107:5a4bd16d:5457|GET / HTTP/1.1|Host:sudoroom.org
+31107:5a4bd16e:5458|HEAD / HTTP/1.1|host:sudoroom.org
matthewstewart commented 6 years ago

It appears at the bottom of the log, after a bunch of GET requests, that there is HEAD with a request to to root route before shut down? Is there any chance that someone knows of the validity of this type of request on our server? Is there evidence of this type of request causing a crash in previous logs?

On Jan 2, 2018 12:23 PM, "Charley Sheets" notifications@github.com wrote:

According to the logs we have, the following requests were outstanding at the time of the crash (log entries truncated here for ease of readability in the issue log):

+31039:5a4bd12b:3768|GET / HTTP/1.1|Host:sudoroom.org +31039:5a4bd12c:3769|GET /mediawiki/index.php?title=Mesh/Blog&action=feed&feed=atom HTTP/1.1|Host:sudoroom.org +31039:5a4bd12e:376a|GET /wiki/User_talk%3a178.154.243.79 HTTP/1.1|Host:sudoroom.org +31039:5a4bd12f:376b|GET /wp-content/uploads/2015/08/IMG_3195.jpg HTTP/1.1|Host:sudoroom.org +31107:5a4bd130:5452|GET /?title=GIT_Version_Control_for_Non_Coders&action=edit&mobileaction=toggle_view_mobile HTTP/1.1|Host:sudoroom.org +31039:5a4bd131:376c|GET / HTTP/1.1|Host:sudoroom.org +31107:5a4bd134:5453|GET /?em_ess=1&event_id=5500 HTTP/1.1|Host:bayareapublicschool.org +31039:5a4bd135:376d|GET /mediawiki/index.php?title=Special%3aWhatLinksHere/File%3aSudoRoom.png&limit=100&hidelinks=1 HTTP/1.1|Host:sudoroom.org +31039:5a4bd136:376e|GET /lists/listinfo/members HTTP/1.1|Host:sudoroom.org +31039:5a4bd143:376f|GET /pipermail/mesh/2013-July/000279.html HTTP/1.1|Host:sudoroom.org +31107:5a4bd14d:5454|GET /wp-content/uploads/2014/10/agua-viva-ND-lispector-tobler.pdf HTTP/1.1|Host:bayareapublicschool.org +31039:5a4bd151:3770|GET / HTTP/1.1|Host:sudoroom.org +31039:5a4bd15c:3771|GET /?title=Hackpack&action=edit&oldid=7062 HTTP/1.1|Host:sudoroom.org +31107:5a4bd15e:5455|GET /lists/listinfo/controllers HTTP/1.1|Host:sudoroom.org +31107:5a4bd162:5456|GET / HTTP/1.1|Host:sudoroom.org +31107:5a4bd16d:5457|GET / HTTP/1.1|Host:sudoroom.org +31107:5a4bd16e:5458|HEAD / HTTP/1.1|host:sudoroom.org

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sudoroom/sudo-infrastructure/issues/6#issuecomment-354866187, or mute the thread https://github.com/notifications/unsubscribe-auth/AHGZvxXCl_7k85D2oIwx_BgteRtnv8WXks5tGpAwgaJpZM4RPzd2 .

rcsheets commented 6 years ago

@matthewstewart we get HEAD requests frequently from Uptimebot and it usually doesn't cause a problem.

It's possible this isn't even triggered by a web request. We host other services on this box, such as mailman. Resource utilization by the list user has spiked right around the time of the last two crashes.

rcsheets commented 6 years ago

It's down again. This time, according to tail -F, there were no outstanding HTTP requests at the time the system stopped responding. Huge spike in disk reads and memory usage by the list user. I'm going to stop investigating web requests and start looking at what mailman is doing. Unfortunately, I don't know much about mailman.

rcsheets commented 6 years ago

While looking at the mailman logs, I discovered and worked around #7.

rcsheets commented 6 years ago

Since the workaround for #7, we have not had any more system crashes. I do not understand why, but I'm closing this issue. Please reopen it if we experience any similar crashes in the future.