pulibrary / ops-catchall

Operations Catch All
0 stars 0 forks source link

Check_MK: localhost (lae) #91

Open kayiwa opened 2 months ago

kayiwa commented 2 months ago
Service PROBLEM notification
Host: localhost (IP: 127.0.0.1)
Service: HTTPS lae
State: CRITICAL
Additional Info
CRITICAL - Socket timeout after 10 seconds

We have an alert that is almost certainly misconfigured to check from content on an endpoint.

acozine commented 2 months ago

The check is currently green, although it does regularly go into Critical/red mode overnight. It's checking lae.princeton.edu for the text Digital Archive of Latin America and Caribbean Ephemera, which is definitely present on the site home page. Maybe the machine reboots? Or the site is unresponsive for some reason?

acozine commented 2 months ago

I checked uptime on lae-prod1 and lae-prod2, both have been up for 8 days, so the problem is not that the servers are rebooting overnight.

acozine commented 1 month ago

We saw multiple alerts and recoveries on this check over the weekend. Both VMs have plenty of space. In the rails logs I see a couple of entries like this:

W, [2024-09-30T00:28:36.394569 #142462]  WARN -- honeybadger: ** [Honeybadger] Error report failed: an unknown error occurred. code=error error="HTTP Error: Net::OpenT
imeout" level=2 pid=142462

and a lot of entries like this:

E, [2024-09-30T00:28:38.832051 #142483] ERROR -- : [dd.env=production dd.service=dpul dd.trace_id=85276041913562946 dd.span_id=1297836789973117643 ddsource=ruby] Health check failed with: execution expired