Collect metrics around OSIO build pipelines

krishnapaparaju commented 6 years ago

We would need to collect these metrics and feed to Zabbix. These metrics would help to understand operational side of things with clear visibility into quality of OSIO user experience.

[ ] Manual initiated from OSIO pipelines screens (& check if build got started with success)
[ ] Number of times, 'view log' been clicked (& check if build log show up with success)
[ ] Able to determine if a build log shows success / failure. If Failure, store the contents of the failures
[ ] An instance is to be idled, is actually getting idled ?
[ ] An instance is to be unidled, is actually getting unidled ?
[ ] How many times web hooks are being received, store the source for these webhooks (& check if build / required action got started with success)

lordofthejars commented 6 years ago

The idea of this comment is to enumerate metrics, put my comments on them, and also ask questions about when and how should this be done. I will notice that my concerns are purely from the point of view of QA so I can be missing some part of the importance of why these metrics are important, but I prefer to just ask it or question them instead of saying nothing.

a) Manual initiated from OSIO pipelines screens

For this metric the only thing that I see that can be useful is in case of knowing the preferences of the user and for example to know if they rerun builds manually instead of webhook, but in terms of QA I don’t see now the benefits.

I am not sure if Jenkins provides this data.

b) Number of times, 'view log' been clicked

What is the important thing of knowing if user has clicked to see the log of the build or not? Most of the time you click on view the log not because there is a failure but because you want to know how the build is going on and what it is doing/in which stage it is.

I am not sure if Jenkins provides this data.

c) Able to determine if a build log shows success / failure. If Failure, store the contents of the failures

You can get this result from build result, it is not necessary to check anything in log. This seems to be a good metric but at the end the build can fail because we have added a regression from Build team or because the project is failing because it contains a flaky test, so at the end what we are having here is something like a flaky metric since we can see that something is failing constantly, then take a look what’s happening and see that it is something not related to build team, which means time to analyze each and every failure detected by the metrics.

I think this data can be retrieved using Rest API of Jenkins.

d) an instance is to be idled, is actually getting idled ?

This is something that might be interesting since I have find some delays and so on. What I don’t know if this is provided by OSO or fabric8 or something developed by us, so I will need some background about it.

e) an instance is to be unriddled, is actually getting unriddled ?

Same as before

f) How many times web hooks are being received, store the source for these webhooks

Exactly the same as point a) In this case it can be used to do some statistics but in terms of QA I don’t see much benefits.

krishnapaparaju commented 6 years ago

@lordofthejars +1 not all these metrics would come from Jenkins , will need to figure out ways / add new components as required to collect these important metrics (less to do with collecting numbers, more around flagging failures)

pradeepto commented 6 years ago

Duplicate of https://github.com/openshiftio/openshift.io/issues/2245

cc @krishnapaparaju @lordofthejars

lordofthejars commented 6 years ago

I have started a quick call with @jaseemabid and our first question was should we provide support for publishing data to Zabbix or to Prometheus. The question is important since Zabbix uses a push model where we are sending data to there, meanwhile in Prometheus is a pull model where we need to provide an http endpoint so Prometheus can get data.

The other thing that it is not covering metrics requested here but provides some metrics regarding master-slaves is https://wiki.jenkins.io/display/JENKINS/Metrics+Plugin so this is something we can also take into consideration.

https://github.com/fabric8io/fabric8-build-team/issues/24

joshuawilson commented 6 years ago

@aslakknutsen how much of the metrics you have been working on can help here?

lordofthejars commented 6 years ago

I have been talking with Aslak about monitoring since he has worked a lot in these scene and we have arrived at some conclusions which affect how to implement this issue.

First of all is that we need to communicate to Promotheus, not Zabbix because OSIO already provides an integration with that and we only need to take care of providing the required endpoints, the rest is done by infrastructure.

Then about what is the best way to proceed, we agreed that for now the best approach will just forget about Jenkins, and let's focus on easiest things which are Jenkins Idler (https://github.com/fabric8-services/fabric8-jenkins-idler/issues/168) and Jenkins Proxy.

So the task or initial task might be "enable promotheus endpoint in your service and collect what ever metrics we need and enable pcp in the service"

"Enable PCP only will give you all cpu/memory/network etc as standard metrics. Expose a Prometheus endpoint, e.g. https://github.com/fabric8-services/fabric8-wit/blob/master/main.go#L421 to collect basic go runtime data like heaps/memory/gc etc. Then using the same client lib you can track your own metrics where ever you see fit in the code e.g. https://github.com/fabric8-services/fabric8-wit/tree/master/metric"

lordofthejars commented 6 years ago

So what I will split this task into next subtasks:

[ ] Enable PCP on Jenkins Idle with standard metrics (cpu/memory/...)
[ ] Enable PCP on Jenkins Proxy with standard metrics
[ ] Collect custom metrics for Jenkins Idle
[ ] Collect custom metrics for Jenkins Proxy
[ ] Validate if it is enough or we go one step further to Jenkins service.

aslakknutsen commented 6 years ago

First of all is that we need to sync to Promotheus

Technically you don't sync to anything. You expose the Promotheus format endpoint. The rest is taken care of

aslakknutsen commented 6 years ago

d, e, f you can get easy via Jenkins Idler/Proxy and normal OSD PCP monitoring route

a, b might be exposed via woopra telemetry already, check with @qodfathr for access and @joshuawilson for tracking the events if they are not.

c from Jenkins/OSO is an unknown route atm but in progress. Previous attempts were not possible to do due to resource issues on the clusters, but might be possible now that Idler is active. Check with @fche for the 'real Prometheus OSO' route for tenants (last I checked it was not ready, only cluster level stuff was tracked). And check with @kbsingh and @fche if 'old idea' would be a viable option until 'real' solution is ready. Alternativly you can avoid all that and track that metrics via Jenkins Idler as well as it should be getting all build events from the cluster.

lordofthejars commented 6 years ago

Currently the original metrics are all of them exposed (proxy and idler metrics) Then there is this issue https://github.com/fabric8-jenkins/jenkins-openshift-base/issues/23 which don't know exactly how to proceed which is about adding prometheus plugin in Jenkins tenant. So maybe this issue could be closed depending on what we decide with https://github.com/fabric8-jenkins/jenkins-openshift-base/issues/23

joshuawilson commented 6 years ago

If you want to track user telemetry via Woopra, you should talk to @rahulm0101.

openshiftio / openshift.io

Collect metrics around OSIO build pipelines #2778