ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
86 stars 44 forks source link

Random 504 GATEWAY_TIMEOUT errors in pods #693

Closed zheileman closed 5 years ago

zheileman commented 5 years ago

We've been observing in all of our services deployed to k8s (staging and production envs across 3 different namespaces) tiny downtimes as a consequence of some requests hitting what seems to be a bad proxy or load balancer or IP, I don't know how to diagnose this.

I can observe this myself by going to a service URL and repeatedly reload the page until eventually, the 504 error happens.

Some URLs that I've been probing:

https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

Pingdom report showing the probes failures:

screen shot 2019-03-05 at 11 48 00 screen shot 2019-03-05 at 11 55 29

It seems the problems started around 26th or 27th February:

screen shot 2019-03-05 at 11 57 16

Contact person

Jesus @ slack

digitalronin commented 5 years ago

I've confirmed this on

https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

I got a 504 (very) roughly once every 50 requests.

I didn't get any 504s on

https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

But, just because it didn't happen in the 100 or so requests I sent, doesn't mean it's not happening.

zheileman commented 5 years ago

Quick update. 24 hours after the bump in cpu/ram allocations we've not had any outage/downtime in any of our pods across 3 different services.

Looks good so far 👍

digitalronin commented 5 years ago

@zheileman How is the service looking? Any more 504s?

zheileman commented 5 years ago

Sorry to report we've had a few more blips recently in a couple environments.

Same 504 error. But I couldn't trigger myself any 504, so perhaps it has resolved again itself?

This is family mediators, after a couple days without 504 errors, we got 2:

screen shot 2019-03-08 at 09 09 43

And this is CAIT, only one 504 so far:

screen shot 2019-03-08 at 09 11 56
pwyborn commented 5 years ago

One of the instances on the load balancer is "out of service" . Docker containers not running (looking into) Family Mediators/FM -0a84b5b061431548c

pwyborn commented 5 years ago

Initial reboot of instance. Just realised that is staging (template deploy)

zheileman commented 5 years ago

Hi @pwyborn

Is that template-deploy? This issue ticket was for k8s really. But if there is any issue with template deploy that can be solved quickly, then that's fine with me 👍

The 504 errors seems to have disappeared now. But will continue monitoring it.

pwyborn commented 5 years ago

Yeah @zheileman - got my wires crossed there. Yes that was template deploy, and there was a problem with one of the instances - it is ok now.

Seemed ok when I check - admittedly only over 5 minutes on each:

$ watch curl -s -o /dev/null -w "%{http_code}" https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

$ watch curl -s -o /dev/null -w "%{http_code}" https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json

pwyborn commented 5 years ago

@zheileman - let me know if you are still getting problems. Difficult to analyse as I do not think application logs going to fluentd/kibana

pwyborn commented 5 years ago

@zheilman - have brought up 2 fresh pods on each - just incase this frees up anything (don't think it will though). Can you please continue to monitor.

zheileman commented 5 years ago

This continues happening although less frequently than before.

These are the micro downtimes for CAIT staging (k8s):

Screen Shot 2019-03-14 at 10 38 06

And these are the micro downtimes for Family Mediators API staging (k8s):

Screen Shot 2019-03-14 at 10 41 01

We are not currently monitoring the production environments (kubernetes) because these are not in use (we are still using the template deploy production envs instead) but I anticipate this is also happening in the production envs.

Please let me know if you need other information.

razvan-moj-zz commented 5 years ago

hi @zheileman there have been quite a few upgrades in the meantime, how does the report look for you?

zheileman commented 5 years ago

Hi @razvan-moj

We continue to observe micro-downtimes with 504 error. It seems to be particularly bad in mediators API production (it is soft-production, not public, we still use template deploy for the "real" production).

As far as I can see this is the only service where we've had 504 in the last 24 hours.

Screen Shot 2019-03-28 at 11 31 06 Screen Shot 2019-03-28 at 11 31 13
razvan-moj-zz commented 5 years ago

@zheileman ingress configuration was changed today, removing ALBs and keeping NLBs which means all errors at HTTP level will reach your logs; please check again in e.g. 24 hrs if you still get the sporadic 504s

zheileman commented 5 years ago

@razvan-moj we've not observed any more 504's for the last few days. One caveat: we've started migrating to live-1 some of these services, and stopped/removed the live-0 ones, so not sure if that's the reason.

In any case, looking good so far 😃