inbo / whip

✅ Human and machine-readable syntax to express specifications for data
MIT License
7 stars 0 forks source link

whip vs DQ IG Tests and assertions #13

Open peterdesmet opened 7 years ago

peterdesmet commented 7 years ago

Lee Belbin send an email on March 3 to the TDWG Biodiversity Data Quality IG group regarding the work of WG2: Data Quality Tests and Assertions:

A select group of TDWGians (highlighted on the Members worksheet on the link below) have invested time to produce what I am calling a core suite of standard tests and associated assertions that can be applied to occurrence records. These tests are to help identify potential occurrence record issues.

Why are we doing this? Largely to try to better align Data Publishers/Data Aggregators/Biodiversity Research Infrastructures/Data Custodians, and hopefully anyone who generates occurrence data. Users would appreciate consistency. A practical example: Merging records from say GBIF and the ALA etc would be greatly facilitated if they both applied the same set of tests.

As a start and to keep it simple, these tests are based on one or more Darwin Core terms. We realize that tests could be applied to all Darwin Core terms, but we wanted a core set that would cover the significant terms that could be implemented relatively easily by all.

The spreadsheet can be found at https://tinyurl.com/h49zwof. Note that this spreadsheet contains a series of worksheets. Please start by reviewing the Principles as they provide a context for what has been learnt during the process.

I'm not familiar with this output, but @cgendreau @tucotuco you're both highlighted as contributing members for this. Would you care to explain the scope of these tests/assertions vs whip? How are the approaches different and what is the chance we're duplicating efforts?

timrobertson100 commented 7 years ago

From what I know of both, they are certainly working in similar spaces, so there are parallel thought processes going on. The TDWG work is targeting a vocabulary and set of tests specifically to apply to DwC terms (started around 3 years ago), with the hope that it will be picked up by GBIF, VertNet and others and applied consistently. The TDWG work will likely not go as far as to define an implementation approach, and leave that as an exercise for the reader. Whip goes a lot further and strictly defines a structure for implementation but is not tied to DwC in any way, nor does it define tests itself. My understanding is that an implementation of the TDWG TG outputs could be produced in Whip or other e.g. EBay Griffin.
For info: GBIF will be trying to work primarily with ALA and hopefully others to standardise interpretation of data in the ingestion routines and will likely use something yet to be decided but compatible with Apache Spark.