After validating normalised data, how should we produce aggregrate stats for the status site?

odscjames commented 4 years ago

We put normalised data through https://github.com/openactive/data-model-validator and store the results in the database.

Then, how should we produce aggregate stats for the status site for each publisher?

In https://github.com/openactive/data-model-validator/issues/349 I noted there can be different values of "severity" for instance - should we filter some of those out?

Ultimately, what does the user want to see on the status page when considering validation stats?

Thanks

thill-odi commented 4 years ago

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

(i) establishing whether an attribute is populated; by (ii) a value of the correct datatype (or matching a particular regex, if we're feeling fancy.

We can check this using JSON-LD.

The end output will be a percentage value for the number of records that satisfy these conditions. In other words, we'll want to display four columns on the status page. If there are 100 records, and:

all of them possess all required attributes and these attributes are correctly populated
80 of them possess all the 'core' attributes. and these are correctly populated
60 of them possess all the correctly-populated 'social prescribing' attributes
3 of them possess correctly-populated accessibility attributes

... then we should end up with a series of columns next to the dataset link with '100', '80', '60', '3'

I don't think there's any value in weighting particular attributes or counting partial satisfaction of the profiles. That way lies madness.

thill-odi commented 4 years ago

Sorry, have just realised after discussion with @nickevansuk that this really only deals with items after normalisation. Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage. And with warnings left unregistered.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

nickevansuk commented 4 years ago

To add some further detail to this, you'll want to filter something like severity === "failure" to only count the validation errors (and ignore the warnings)

See validator integration from test suite for more info: https://github.com/openactive/openactive-test-suite/blob/master/packages/openactive-integration-tests/test/shared-behaviours/validation.js#L94

nickevansuk commented 4 years ago

To provide "a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users", as @thill-odi mentions above, one option is that a specific link to the validator which includes the item in the feed can be constructed as follows: https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

For example:

Note that the validator only validates the first 10 non-deleted items in any RPDE page that's provided, so the rpdeId parameter is required to ensure the item in question is validated by the online validator.

When you're using the validator programmatically, RPDE items should be validated individually (i.e. the data of the item should be validated, rather than the whole RPDE page validated), to ensure that all items are validated.

odscjames commented 4 years ago

Thanks for many replies - this touches on a lot and is interesting. I'm going to move a lot of things out to other issues tho, and be strict about keeping this on track on the original question. Hope that's ok.

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

So Conformant is the results from the validation library and the other 3 are data profiles?

Because these come from different mechanisms i'd like to deal with them differently - I'll deal with data profiles in another ticket soon.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

one option is that a specific link to the validator which includes the item in the feed can be constructed as follows: https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

Moved to https://github.com/openactive/conformance-status-page/issues/4

Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage.

To be clear:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

So, we take the results and filter ...

Remove warnings (Tim said in https://github.com/openactive/conformance-services/issues/14#issuecomment-647454758 )
Filter severity === "failure" - only count failure (Nick said in https://github.com/openactive/conformance-services/issues/14#issuecomment-647459045 )

Can you be clear which one it is?

Then on the status page; show a % of how many records pass - ie have no validation library results against them after filtering.

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

nickevansuk commented 4 years ago

Filter severity === "failure" is the one. By removing "warnings" Tim meant removing warning, notice, and suggestion (which are all classed as "warnings" in the OpenActive Test Suite)

nickevansuk commented 4 years ago

Also on the other points:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

Yes, so that publishers can fix issues

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

Suggest ignoring deleted records

odscjames commented 4 years ago

Q: In the case where a publisher has multiple feeds (eg https://onlinebooking.1610.org.uk/OpenActive/ Slot, FacilityUse, ..... ) should we calculate the stat per Publisher or per each feed for that publisher?

nickevansuk commented 4 years ago

not sure how it's presented, guess it depends on the UI - "% of data published that is conformant" would work per-publisher, but as in https://github.com/openactive/conformance-status-page/issues/4 they need to get to the detail of which feeds have errors, (so a % conformance per-feed could be useful?) and example pages/items within the feeds that exhibit the errors

Ideally we want the headline of every publisher being 100% conformant (though this is unlikely to be the case on day 1 of this tool going live)

robredpath commented 4 years ago

Validate normalised data against profiles; expose results in API; display on status page

openactive-archive / conformance-services

After validating normalised data, how should we produce aggregrate stats for the status site? #14