jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

Monitor builds on our private instances (trusted.ci.jenkins.io / infra.ci.jenkins.io / release.ci.jenkins.io) #2843

Open dduportal opened 2 years ago

dduportal commented 2 years ago

The (private) Jenkins controller trusted.ci.jenkins.io has some jobs that are regularly executing importants tasks on some of the Jenkins or Jenkins Infra repositories, such as:

It's but numerous time where these builds were failing to execute, or failing to even be scheduled, which led to outdated artefacts or website, slwoing down (or blocking) users.

We need to have a monitoring of these important jobs to ensure that the team is alerted quickly enough and notifies users proactively as with any major production incidents.

The challenge is that this Jenkins controller is a private one, so we need to control what information are exported (e.g. "simple notifications or github status checks" are risky. Read #2834 if you do not agree :) ).

dduportal commented 2 years ago

Proposal by @daniel-beck after we asked him for help and advise:

If you do not want to provide credentials to monitoring, create a new job on trusted.ci that periodically publishes a JSON file to reports.jenkins.io with information about the other jobs on that instance. Monitoring:

  • Check the file timestamp. If older than 2x build interval, watchdog is dead (if you can't, put the current UTC time during generation into the file as a field and parse that).
  • Check contents for each of the jobs, in whatever format you want to provide information on them. That's the 5 minute hack solution, there are probably better ones depending on what monitoring tools we use and what they can do. Here I assume Last-Modified support (if possible) and JSON parsing.
dduportal commented 2 years ago

Another idea (non mutually exclusive): notification on IRC channel with the controller hostname, job name and status. It would cover the build failure at least (but not the "unable to schedule builds)

MarkEWaite commented 2 years ago

Would it be any easier or more portable to replicate the RSS feeds from trusted.ci.jenkins.io to a publicly visible location?

I've been using the RSS feeds from specific jobs on ci.jenkins.io as a low cost monitoring system for jobs that I select. It is visible as a small icon on my Google Chrome browser.

I use https://feeder.co/ to monitor the job failures RSS feeds like https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/rssFailed . That shows a small number on my Google Chrome web browser bar when there is a failure. When I have time and when I notice the failure count, I click the RSS feed and it opens the page with the failure.

I think this may still be more complicated than Daniel's idea of a job on trusted.ci.jenkins.io that exports failures to a public location.

As an angle on Daniel's idea, I have a separate Python script that I use today with my Jenkins test instance to report if a job associated with a resolved Jenkins Jira issue is failing. I may try an experiment to convert that script into a Jenkins job that might be reusable as the type of "inside Jenkins" job monitor that Daniel has described.

daniel-beck commented 2 years ago

Would it be any easier or more portable to replicate the RSS feeds from trusted.ci.jenkins.io to a publicly visible location?

An additional translation layer we control would be useful IMO. E.g. renaming a job on trusted CI should not break monitoring. Exposing history beyond the latest build is likely also unnecessary.

lemeurherve commented 8 months ago

@daniel-beck I opened https://github.com/jenkins-infra/infra-reports/pull/62, could you give me your opinion about it when you have some time please?