kostmo / circleci-failure-tracker

A log analyzer for CircleCI. Note that this project is now hosted at pytorch/dr-ci
https://github.com/pytorch/dr-ci
5 stars 2 forks source link
ci circleci log-analysis

A log analyzer for CircleCI

Intro

An organization would like to determine what the most common causes of intermittent build failures/flaky tests are in the a repository so that effort can be prioritized to fix them.

Outputs

The Dr. CI project entails two distinct user-facing outputs:

The latter has several distinct utilities:

Codebase

See docs/CODEBASE-OVERVIEW.md.

Repository assumptions

Dr. CI assumes a linear history of the master branch. This can be enforced on GitHub via the following setting under the "Branches" -> "Branch protection rule" section for master:

GitHub setting

Functionality

This tool obtains a list of CircleCI builds run against a GitHub repository for a master branch, downloads their logs (stripped of ANSI escape codes) from AWS, and scans the logs for a predefined list of labeled patterns (regular expressions).

These patterns are curated by an operator. The frequency of occurrence of each pattern are tracked and presented in a web UI.

The database tracks which builds have been already scanned for a given pattern, so that scanning may be performed incrementally or resumed after abort.

Tool workflow

Known Problem reporting

Requiring that failures in the master branch be annotated will facilitate tracking of the frequency of "brokenness" of master over time, and allow measurement of whether this metric is improving.

It is possible for only specific jobs of a commit to be marked as "known broken", e.g. the Travis CI Lint job.

Log scanning data flow diagram

flow diagram

Deployment

Development Environment Setup

See: docs/development-environment

AWS dependencies and deployment

See: docs/aws

Ingestion overview

  1. A small webservice (named gh-notification-ingest-env in Elastic Beanstalk, and hosted at domain github-notifications-ingest.pytorch.org) receives GitHub webhook notifications and stores them (synchronously) in a database.
  2. A periodic (3-minute interval) AWS Lambda task EnqueSQSBuildScansFunction queries for unprocessed notifications in the database, and enqueues an SQS message for each of them.
  3. Finally, an Elastic Beanstalk Worker-tier server named log-scanning-worker process the SQS messages as capacity allows.

We want a cool-off period during which multiple builds for a given commit can be aggregated into one task for that commit. This is accomplished via an SQS deduplicating queue, where multiple instances of the same commit are consolidated while in the queue.

Optimizations

Other Features

Source attribution

Aho-Corasick implementation is from here: https://github.com/channable/alfred-margaret