mitodl / ol-infrastructure

Infrastructure automation code for use by MIT Open Learning
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

forum Healthcheck isn't reliable #2558

Open Ardiea opened 4 months ago

Ardiea commented 4 months ago

Expected Behavior

If forum can't talk to its mongodb or opensearch backends, the app should crash / stop outright. Not enter a funky state where the ASG / LB healthcheck passes but the app itself isn't working.

Current Behavior

If forum can't find it's mongodb or opensearch instances for 10 minutes, it just stops looking for them and enters a catatonic state where it is still 'running' good enough for the LB healthchecks to pass but it isn't really working because it won't answer any requests, and the container is possibly stopped / not listening.

Possible Solution

Put traefik infront of the container to create a healthcheck endpoint that works? Figure out the behavior of forum and adjust the healthcheck status matcher appropriately.

Additional Details

Discussion starting here and going to about 4pm that day. https://mitodl.slack.com/archives/C02QLTAE05S/p1721329113019089

pdpinch commented 1 month ago

Do you think we can work on this and have it deployed with the Sumac updates to xPRO and Residential MITx?

cc @blarghmatey and @feoh