Closed zheileman closed 5 years ago
I've confirmed this on
https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
I got a 504 (very) roughly once every 50 requests.
I didn't get any 504s on
https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
But, just because it didn't happen in the 100 or so requests I sent, doesn't mean it's not happening.
Quick update. 24 hours after the bump in cpu/ram allocations we've not had any outage/downtime in any of our pods across 3 different services.
Looks good so far 👍
@zheileman How is the service looking? Any more 504s?
Sorry to report we've had a few more blips recently in a couple environments.
Same 504 error. But I couldn't trigger myself any 504, so perhaps it has resolved again itself?
This is family mediators, after a couple days without 504 errors, we got 2:
And this is CAIT, only one 504 so far:
One of the instances on the load balancer is "out of service" . Docker containers not running (looking into) Family Mediators/FM -0a84b5b061431548c
Initial reboot of instance. Just realised that is staging (template deploy)
Hi @pwyborn
Is that template-deploy? This issue ticket was for k8s really. But if there is any issue with template deploy that can be solved quickly, then that's fine with me 👍
The 504 errors seems to have disappeared now. But will continue monitoring it.
Yeah @zheileman - got my wires crossed there. Yes that was template deploy, and there was a problem with one of the instances - it is ok now.
Seemed ok when I check - admittedly only over 5 minutes on each:
$ watch curl -s -o /dev/null -w "%{http_code}" https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
$ watch curl -s -o /dev/null -w "%{http_code}" https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
@zheileman - let me know if you are still getting problems. Difficult to analyse as I do not think application logs going to fluentd/kibana
@zheilman - have brought up 2 fresh pods on each - just incase this frees up anything (don't think it will though). Can you please continue to monitor.
This continues happening although less frequently than before.
These are the micro downtimes for CAIT staging (k8s):
And these are the micro downtimes for Family Mediators API staging (k8s):
We are not currently monitoring the production environments (kubernetes) because these are not in use (we are still using the template deploy production envs instead) but I anticipate this is also happening in the production envs.
Please let me know if you need other information.
hi @zheileman there have been quite a few upgrades in the meantime, how does the report look for you?
Hi @razvan-moj
We continue to observe micro-downtimes with 504 error. It seems to be particularly bad in mediators API production (it is soft-production, not public, we still use template deploy for the "real" production).
As far as I can see this is the only service where we've had 504 in the last 24 hours.
@zheileman ingress configuration was changed today, removing ALBs and keeping NLBs which means all errors at HTTP level will reach your logs; please check again in e.g. 24 hrs if you still get the sporadic 504s
@razvan-moj we've not observed any more 504's for the last few days. One caveat: we've started migrating to live-1
some of these services, and stopped/removed the live-0
ones, so not sure if that's the reason.
In any case, looking good so far 😃
We've been observing in all of our services deployed to k8s (staging and production envs across 3 different namespaces) tiny downtimes as a consequence of some requests hitting what seems to be a bad proxy or load balancer or IP, I don't know how to diagnose this.
I can observe this myself by going to a service URL and repeatedly reload the page until eventually, the 504 error happens.
Some URLs that I've been probing:
https://family-mediators-api-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
https://fj-cait-staging.apps.cloud-platform-live-0.k8s.integration.dsd.io/ping.json
Pingdom report showing the probes failures:
It seems the problems started around 26th or 27th February:
Contact person
Jesus @ slack