mozilla / addons

☂ Umbrella repository for Mozilla Addons ✨
Other
123 stars 53 forks source link

The versioncheck and services endpoints return 500 for health check #1935

Open bqbn opened 1 year ago

bqbn commented 1 year ago

Describe the problem and steps to reproduce it:

We got 500 error when trying to hit /__heartbeat__ for the services and versioncheck endpoint.

For example, on one of the services_web instances on the stage environment,

$ curl -i -s -H "host: services.addons.allizom.org" http://0.0.0.0:4000/__heartbeat__
HTTP/1.1 500 Internal Server Error
Content-Type: application/json
Expires: Fri, 26 May 2023 20:07:03 GMT
Cache-Control: max-age=0, no-cache, no-store, must-revalidate, private
X-AMO-Request-ID: 9291dc0d82854fd692e89bb5065d2f29
Content-Length: 260
Content-Security-Policy: img-src 'self' blob: data: https://addons.mozilla.org/static-server/ https://addons.mozilla.org/user-media/ https://addons.allizom.org/user-media/ https://addons.allizom.org/static-server/; connect-src 'self' https://*.google-analytics.com; script-src https://www.google-analytics.com/analytics.js https://www.googletagmanager.com/gtag/js https://www.recaptcha.net/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.gstatic.cn/recaptcha/ https://addons.mozilla.org/static-server/ https://addons.allizom.org/static-server/; object-src 'none'; frame-src https://www.recaptcha.net/recaptcha/; child-src https://www.recaptcha.net/recaptcha/; style-src 'unsafe-inline' https://addons.mozilla.org/static-server/ https://addons.allizom.org/static-server/; media-src https://videos.cdn.mozilla.net; form-action 'self'; default-src 'none'; font-src 'self' https://addons.mozilla.org/static-server/ https://addons.allizom.org/static-server/; report-uri /__cspreport__
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
Referrer-Policy: same-origin
Cross-Origin-Opener-Policy: same-origin
Vary: Accept-Encoding

{"memcache": {"state": true, "status": ""}, "libraries": {"state": true, "status": ""}, "elastic": {"state": true, "status": ""}, "path": {"state": false, "status": "check main status page for broken perms / values"}, "database": {"state": true, "status": ""}}

What happened?

When we move AMO to the GKE platform, Kubernetes will check __heartbeat__ for the pod readiness. Thus we need those two endpoints to be able to return 200 when it is ready to serve traffic.

What did you expect to happen?

For the services and versioncheck endpoints to return 200 when they're ready to serve traffic.

Anything else we should know?

n/a

┆Issue is synchronized with this Jira Task

diox commented 1 year ago

Is EFS mounted on services and versioncheck ? The function that is failing is checking permissions on various paths.

bqbn commented 1 year ago

Oh, it wasn't, and after mounting the NFS share, the __heartbeat__ works.

Is it possible for these two components to pass the health check without mounting the NFS share? They don't really need the share and currently in production (AWS) we don't mount it.

But the issue doesn't block GCP migration though. We'll mount the share for now.

diox commented 1 year ago

Is there an env variable or something other than the request URL I can use to detect we're on a services or versioncheck instance ?

bqbn commented 1 year ago

We can pass an env variable, such as AMO_COMPONENT or ADDONS_SERVER_COMPONENT to the app container to help it identify itself.

diox commented 1 year ago

Yes, that would be helpful to fix this.

KevinMind commented 2 months ago

Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-376