hetio / hetionet

Hetionet: an integrative network of disease
https://neo4j.het.io
264 stars 69 forks source link

neo4j website down #48

Closed jromanowska closed 2 years ago

jromanowska commented 2 years ago

Hi, I've been trying to reach the hetionet at neo4j platform (https://neo4j.het.io/browser/) but it doesn't load - I've tried several internet explorers, both on Linux and Windows. It just spins and shows nothing.

dhimmel commented 2 years ago

Ah thanks for the heads up. I though we created an uptime check in https://github.com/hetio/hetionet/issues/45#issuecomment-1164970271 that would restart the instance if it became unresponsive like this.

Tagging @falquaddoomi who helped last time. I can restart the instance, but might be good to keep this error active so we can make sure the uptime check detects it. (@falquaddoomi no rush, don't interrupt your weekend).

falquaddoomi commented 2 years ago

Sorry for the trouble you've been having with the sevice, @jromanowska. Also, hey @dhimmel; we do have an uptime check set up for the neo4j instance, but it just reports that the instance is inaccessible, it doesn't reboot it. Also, it's unfortunately very noisy, so it's hard to tell when a real outage is occurring versus a transient network issue on Google's side. I'd assumed since no one complained that these were just transient issues, but apparently not -- I'll look into them as soon as they come up now.

After looking into the logs a bit today, it seems the neo4j instance hits a series of out-of-memory exceptions that cause it to stop being able to fully service requests. Oddly, it'll still serve static resources, just with very high (30 seconds+) latency. I'm going to try bumping up the RAM on the instance, and I'll also add a daemon on the machine itself that checks if https://neo4j.het.io/browser/ is responsive and reboots the docker container if it isn't. I'll keep investigating why this is happening, since if there's a memory leak what I proposed will just delay the outages, not eliminate them.

Perhaps let's keep this issue open for a week or so to see if the issue's resolved, and after that we can close it?

falquaddoomi commented 2 years ago

Just FYI, I've put in a monitoring script that'll reboot the neo4j container if https://neo4j.het.io/browser/ takes longer than 30 seconds to return, or if it returns a non-200 response. I've also increased the RAM on the instance from 8GB to 12GB, and I'll be watching the logs and the uptime check for "transient" issues as well. Here's hoping that the changes I made will improves its stability, but do let me know if any of you have issues with it. 🤞

dhimmel commented 2 years ago

Thanks a lot @falquaddoomi! Stoked that we're able to automate the restarts.

I'll keep investigating why this is happening, since if there's a memory leak what I proposed will just delay the outages, not eliminate them.

But the outages will be short-lived and the reboot will reset the memory usage right?

Since the instance is running a pretty old version of Neo4j, there's probably not a ton of value in spending much time diagnosing the memory leak. I played around with upgrading in https://github.com/hetio/hetionet/pull/33, but was hitting a bunch of problems.

So in summary, don't worry too much about digging into the memory leak unless you think that will create an actionable insight.

falquaddoomi commented 2 years ago

Right, the outages shouldn't be more than 5 minutes (that's the current polling interval), and if necessary the entire neo4j container gets restarted, which would reset its memory usage. Fair point about it not being worth tracking down a memory leak in an older version of neo4j. I'll take a look at #33 and see if I can make progress on it.

dhimmel commented 2 years ago

I'll take a look at https://github.com/hetio/hetionet/pull/33 and see if I can make progress on it

Any help appreciated but a forewarning that there's several things that were breaking: guides, HTTPS, and more. So happy to video chat at any point and give you an overview of the hurdles if that'd be helpful.