feat: Data quality measures

nickevansuk commented 1 year ago

This PR introduces data quality measures, based on validator results.

Measures are defined by ”exclusions”, which are references to specific types of validator errors. When a measure is calculated, and item is counted towards the measure total count unless is it is “excluded” by one of the validator errors referenced by its “exclusions”.

For example:

{
  name: 'Has a name',
  description: 'The name of the opportunity is essential for a participant to understand what the activity',
  exclusions: [
    {
      errorType: [
        ValidationErrorType.MISSING_REQUIRED_FIELD,
      ],
      targetFields: {
        Event: ['name'],
        FacilityUse: ['name'],
        IndividualFacilityUse: ['name'],
        CourseInstance: ['name'],
        EventSeries: ['name'],
        HeadlineEvent: ['name'],
        SessionSeries: ['name'],
        Course: ['name'],
      },
    },
  ],
},

In the above measure, an item will not be counted towards the total and percentage if the validator error “MISSING_REQUIRED_FIELD” is present for the target fields “name”.

The advantage of this approach is that the complex inheritance rules respected by the validator are implicitly considered, and that more complex validation rules such as activity list matching are easily included without any duplicated logic. Tests can easily be written for complex rules also, as the validator already provides a framework for this.

This increases maintainability, flexibility, and consistency of results across tools. The approach is also extensible, and encourages the creation of new data quality rules in the validator as data quality measures become more in-depth: this has the advantage of surfacing errors at a more detailed level within the various OA tools, as well as providing a high-level summary.

Measures are defined within “profiles”, which allows for subsets of measures to be defined distinctly for different use cases (e.g accessibility).

Measures are defined within this repository, so that they can be used within both the Validator GUI and the Test Suite, and be maintained alongside the validation rules on which they depend.

(Note that this PR is in draft, and requires some refactoring and tidying up before merging)

Screenshot of unstyled results below: Screenshot 2023-04-05 at 09 52 21

Open questions:

How would we ideally display this visually to users? (The output is a simple mustache template; rough design spreadsheet here)
Do we need to think about combining parent and child in the feed within the test suite for a more accurate assessment of e.g. the url? (Less relevant for the current measures, which are mostly based on required fields)

howaskew commented 1 year ago

Here's an example output from my work via the visualiser...

Screenshot 2023-04-05 at 12 38 37

The idea is a simple, intuitive, visual summary of the smaller set of DQ metrics discussed at W3C. It's a stepping stone into the detail in the validator report.

nickevansuk commented 1 year ago

@howaskew looks great! Postcode validation is a great example of a rule that would be helpful in the validator too (centralising logic etc).

It is cool having it visible in the visualiser as data users might be browsing feeds there - am thinking about whether setting up the validator to build as a lightweight client-side library might give us the best of both worlds - centralising logic and still having the view on the visualiser...

Or even easier we could just store nightly DQ reports which we embed in a tab on the visualiser, and reference on the status page. That might be even better? One pre-cached source of truth.

openactive / data-model-validator

feat: Data quality measures #420