tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

How to map BDQ tests to CMS #195

Open debpaul opened 2 years ago

debpaul commented 2 years ago

Greetings BDQ folks. Watching all of you hard at work on these tests and assertions.

I'm thinking it would be great if we could capture this collectively. In other words, in some sort of form / standard form from the get-go so that it's easy for both the software developers and those using the software, or thinking about using it, to evaluate how data fitness is built-in, or not.

I'm imagining a google sheet with the

chicoreus commented 2 years ago

@debpaul

If I (or anyone) want to map the tests (both hard and soft validation) that our CMS does to the BDQ tests and assertions, do you have guidance for that?

I expect that will develop as we develop text for a standard. The formal description of the tests using the language of the framework is intended to be implementation neutral. @tucotuco implemented earlier versions of many of the tests in python and in sql, I'm updating java implementations that are closely linked to the framework. The tests are intended to be applied anywhere in data pipelines in either quality control or quality assurance roles, so they can go into early data capture tools, into ingest tools for collections management systems, within collections management systems, between collections management systems and aggregators, by aggregators, or by consumers of data to limit records to those fit for CORE purposes.

Expect a couple general pieces of guidance, one being that an effective means to use the tests is to run all of the Validations and Measures on a data set, then run the Amendments, then run the Validations and Measures again on data with all of the proposed changes from the Amendments applied to the data, with the differences providing a measure of how much accepting the Amendments may improved the data for CORE purposes, but this isn't the sole way to compose the tests, they do stand independently to be composed as needed for quality assurance and quality control purposes.

Another general piece of guidance to expect is that the response from a particular test will consist of three parts, a status indicating whether or not the test could be run or not and whether failures are internal to the data or external to connectivity, a result containing the structured results of the test, and a comment with non-normative content providing human readable guidance on why the test returned the results it did.

Development of the tests is occurring in the tdwg/bdq/issues with the comments on the issues being used for rationalle management as we refine the test descriptions. Descriptions of the tests expressed in terms of the framework are exported to https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_tests.csv and a proof of concept RDF representation is generated from this at https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_tests.xml Data to validate implemntations of the tests is being worked on in a spreadsheet @Tasilee is maintaining that is being periodically exported to https://github.com/tdwg/bdq/blob/master/tg2/core/TG2_test_validation_data.csv (and we are currently in a phase of refining the test definitions and this test validation data set including feedback from running developing implementations of the tests against the validation data set).

I'm actively updating Java implementations of the tests (and cross checking them against the developing test data set) in the event_date_qc, geo_ref_qc, and sci_name_qc libraries. https://github.com/FilteredPush/event_date_qc https://github.com/FilteredPush/geo_ref_qc https://github.com/FilteredPush/sci_name_qc None of these are ready for production use with the current test specifications, but event_date_qc is getting clsoe. Discussion and refinement of test descriptions and discussion and refinement of the validation data set, and updating the implementations and running them on the validation data makes this very much a moving target right now.

Do you have such CMS use cases lined up (TaxonWorks, Arctos, Specify, Brahms, Symbiota, VertNet, homegrown, etc)?

I'm actively integrating Java implementations of the tests in MCZbase. See: https://github.com/MCZbase/MCZbase/blob/master/dataquality/component/functions.cfc This works with the event_date_qc, sci_name_qc, and geo_ref_qc java libraries installed as libraries for coldfusion. This implementation should be very straightforward to add to Arctos.

If you do have these lined up or plan to do in the future, please point me to documentation and put TW name on the list.

TG2 had commitments from GBIF, iDigBio, and ALA to implement and embed the tests in their systems. @ArthurChapman as convener of the BDQ IG can comment on those.

debpaul commented 2 years ago

If you do have these lined up or plan to do in the future, please point me to documentation and put TW name on the list.

TG2 had commitments from GBIF, iDigBio, and ALA to implement and embed the tests in their systems. @ArthurChapman as convener of the BDQ IG can comment on those.

Thanks @chicoreus - I appreciate the detailed response. I'd like to see if any (all?) CMS developers would be interested in implementing these tests "closer to home." This would automatically ensure greater fitness from the get-go at the aggregator level. And it would provide a transparent metric and target for software developers to implement locally.

At some point in the future, I'd also appreciate guidance with mapping (In other words, who will help me if I have questions). I intend to map what we already do in TW with these BDQ tests.