Need to create system status page of the che.openshift.io

ibuziuk commented 5 years ago

Currently there is no status page for che.openshift.io which would provide information about the state of the platform. There are many different online services that are providing information about the state of their platform:

It was decided instead of creating custom dsaas service use account on https://www.statuspage.io/

sub-tasks:

[x] Create che.openshift.io account on statuspage.io and contribute info about github state https://github.com/redhat-developer/rh-che/issues/1240
[x] Investigate how prometeus metrics could be used on statuspage.io https://github.com/redhat-developer/rh-che/issues/1237
[ ] Need to contribute 2 system metrics to statuspage https://github.com/redhat-developer/rh-che/issues/1286

Related openshift.io user-story - https://github.com/openshiftio/openshift.io/issues/4730

slemeur commented 5 years ago

Should we have the root epic openshiftio/openshift.io#4730 under this repository?

ibuziuk commented 5 years ago

@slemeur having user-story under openshift.io works just fine IMO (I personally do not think we should have it under rh-che since the status service would be a separate repo)

fche commented 5 years ago

Does this service allow you to feed metrics or programmatic configuration changes to it? e.g. can you tell it to start monitoring a given route url that doesn't exist yet, and measure time until it does?

ibuziuk commented 5 years ago

can you tell it to start monitoring a given route url that doesn't exist yet, and measure time until it does ?

@fche hmm.. why would you like to start monitoring non-existing route ? The main question we are currently having is if statuspage.io can support Prometheus format properly - https://github.com/redhat-developer/rh-che/issues/1237

fche commented 5 years ago

why would you like to start monitoring non-existing route ?

Related to the other need to track openshift route-creation times. Notify service at oc api call start time, let it determine time taken for route to be actually accessible.

ibuziuk commented 5 years ago

@fche AFAIK, it is planned to be done on che-server side and exposing via prometheus metric - https://github.com/eclipse/che/issues/12699

fche commented 5 years ago

cc: @gorkem In other than the short term, does this sounds like the sort of tool we should provide for ourselves, as opposed to outsourcing it?

fche commented 5 years ago

it is planned to be done on che-server side

OK, assuming it is in a position to reliably tell whether the routes are externally accessible. BTW, submitted this RFE for openshift to consider supplying this info itself: https://github.com/openshift/origin/issues/22107

ibuziuk commented 5 years ago

In other than the short term, does this sounds like the sort of tool we should provide for ourselves, as opposed to outsourcing it?

@fche if we opt for a custom dsaas service the major question is, who will be the primary owner / maintainer ?

fche commented 5 years ago

who will be the primary owner / maintainer

aye, there is the rub

But independent of that question, one can work out in greater detail just what info you'd like to see there.

ibuziuk commented 5 years ago

@fche I believe most of the details are covered in the following user-story - https://github.com/openshiftio/openshift.io/issues/4730

fche commented 5 years ago

What do you think the chances are that many or all of the datasets you are talking about could be rendered entirely as grafana (or perhaps pcp) dashboards? So, assume there is a queriable metric database nearby the rhche server. Assume it's been gathering the status/health metrics being discussed over at openshiftio/openshift.io#4730. Does the "system status" have to be anything other than a preconfigured dashboard - with some combination of graphical or textual forms we can generate?

ibuziuk commented 5 years ago

What do you think the chances are that many or all of the datasets you are talking about could be rendered entirely as grafana (or perhaps pcp) dashboards?

I believe everything could be rendered entirely via grafana, but the goal of statuspage is to make it user-friendly, easy to update, easy to notify users, easy to create incident, easy to scheduled maintenance etc. So, graphana and status page are two different beasts.

fche commented 5 years ago

Could we think about it as the public status-page being downstream of our internal status dashboards & machinery? i.e., not tightly coupled to che, but rather to a hypothetical dev-console health dashboard?

ibuziuk commented 5 years ago

IMO, che.openshift.io is a very special case not ~~tightly~~ related to the SaaS which deserves own status page

fche commented 5 years ago

Understood, just trying to minimize number of bits of machinery and maximize reusability. Maybe think of it more like - a running copy of che should have its own health display for benefit of each of its users. Can the public dashboard be another consumer of that same data & maybe even some of the same renderings?

ibuziuk commented 5 years ago

well, potentially it could, but ideally status page should be deployed separately from the monitored service - if the service is down, status page should be still up with the reported accident (if status page is part of the service itself it would be down together with the service during incident / scheduled maintenance)

fche commented 5 years ago

Yup, kind of like a reliable mirror.

fche commented 5 years ago

As a prototype, before we do a full proper operator / openshift4 / prometheus flavoured thing, we could perhaps layer a small piece of new code on top of the existing osd-monitor-poc pcp-based infrastructure, to relay metric threshold crossing events to statuspage.io. We'd need to know a sample metric name and threshold predicate, and statuspage.io api credentials.

ibuziuk commented 5 years ago

@fche will you be able to give a hand with impl. push part in the next sprint (first we need to figure out which metrics are we going to push - hobby plan offers only 2 system metrics, so we need to be picky) ?

fche commented 5 years ago

Can indeed help with a quick prototype, presuming building on the present osd-monitor-poc machinery, not major new stuff. It's about as complicated as adding a new outbound zabbix relay.

ibuziuk commented 5 years ago

Sounds good, I will reach you once I would have more details about params for statuspage API

ibuziuk commented 5 years ago

Closing this epic since https://che.statuspage.io/ is setup and we have a separate issue for contributing system metrics to statuspage (which is currently not a priority) - #1286

redhat-developer / rh-che

Need to create system status page of the che.openshift.io #1224