tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-VALIDATION_MAXELEVATION_INRANGE #112

Open tucotuco opened 6 years ago

tucotuco commented 6 years ago
TestField Value
GUID c971fe3f-84c1-4636-9f44-b1ec31fd63c7
Label VALIDATION_MAXELEVATION_INRANGE
Description Is the value of dwc:maximumElevationInMeters of a single record within a valid range?
TestType Validation
Darwin Core Class dcterms:Location
Information Elements ActedUpon dwc:maximumElevationInMeters
Information Elements Consulted
Expected Response INTERNAL_PREREQUISITES_NOT_MET if dwc:maximumElevationInMeters is bdq:Empty or the value cannot be interpreted as a number; COMPLIANT if the value of dwc:maximumElevationInMeters is within the range of bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters inclusive; otherwise NOT_COMPLIANT
Data Quality Dimension Conformance
Term-Actions MAXELEVATION_INRANGE
Parameter(s) bdq:minimumValidElevationInMeters
bdq:maximumValidElevationInMeters
Source Authority bdq:minimumValidElevationInMeters default = "-430"
bdq:maximumValidElevationInMeters default = "8850"
Specification Last Updated 2023-09-18
Examples [dwc:maximumElevationInMeters="0": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:maximumElevation is in is range"]
[dwc:maximumElevationInMeters="-500": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:maximumElevation is not in range, i.e. is <-430"]
Source ALA, GBIF
References
Example Implementations (Mechanisms)
Link to Specification Source Code
Notes We have rounded up the Parameter values. We are aware of sub-ice elevations in Antarctica to -3,500m and possible sampling in the atmosphere above the elevation of the top of Mt Everest that would fail this test but we support the odd false positive.
ArthurChapman commented 6 years ago

Likeliness in Data Quality Dimension altered to Likelihood

tucotuco commented 2 years ago

Added Description "A test to determine if the value of dwc:maximumElevationInMeters is within a valid range." as a strawman pattern to describe a simple spatial validation test.

ArthurChapman commented 2 years ago

This comes back to the earlier strawmen #163, #162. It was asked if we use for example, dwc:maximumElevationInMeters or convert to simple English for example

2). "A test to determine if the value of the Maximum Elevation is within a valid range"

Do we wish to say what type of test e.g.

3). A validation test to determine if the value of the Maximum Elevation is within a valid range"

@chicoreus suggests we should add "of a single record" - I am not quite convinced that this is necessary as all our tests are single record except for the MEASUREs.

4). A validation test to determine if the value of the Maximum Elevation of a single record is within a valid range"

chicoreus commented 2 years ago

@ArthurChapman yes, even though the core tests only cover single records, it is important that we include that in the description, as the description is of the Criterion in Context (for Validations (ContextualizedCriterion in the RDF), which has three components: the Criterion, the InformationElement, and the Resource type. Similarly for ContextualizedEnhancement, ContextualizedDimension, and :ContextualizedIssue.

We don't need to include the test type, but it may be clearer if we do, though I would suggest in a verb form for clarity, e.g.

5) Validate that the maximum elevation of a single record is within a valid range.

Perhaps patterns like these:

Validate that the {information element} of a single record is { criterion }

Propose an amendment for the {information element } of a single record { enhancement } if { conditions to amend }

Measure the { dimension } of the {information element } of a single record.

Check for an issue in the { information element } of a single record where { criterion }

tucotuco commented 2 years ago

1) To me, it seems important to state what the "thing" is, not just what it does. That's why I put. "A test". 2) I haven't assessed all the validations, but in this and other cases the inclusion of "single record" is unnecessarily limiting in the best case. By this I mean that there is no need for a "record" for this test. It can be applied against a value for a term independent of context. Indeed, it seems a great danger to rely on the concept of a "record". What is a "record" in highly structured data? If we go this route, we will be limiting the tests to Simple Darwin Core, which seems sad. 3) It also seems less than rigorous to refer to the subject of the test as something undefined (e.g., "Maximum Elevation"). This test can be rigorously applied specifically to dwc:maximumElevationInMeters. Say so.

To me, the description I provided gets everything mentioned across succinctly in unambiguous human readable language. I am not in favor of any alternatives provided so far (for this term, at least).

chicoreus commented 2 years ago

@tucotuco for a validation, under the framework, the thing for which the description we are trying to frame is for the criterion in context, that is a very specific concept in the quality needs level, and consists of a criterion, an information element, and a resource type.

A validation which operates on a multi record is a different test from the validations which we have described. Similar validations that operate on multi records are easy to define, but they aren't the tests that we are defining. This gets to the core of the mathematical formalization of the framework. The place where an information element is composed with a criterion is in the criterion in context, where it is also composed with the resource type (single record or multi record). There is nowhere in the framework to put the generalization that you want where the information element and the criterion are composed without the resource type. This is a good generalization, and of use in informally framing some family of related tests, but it isn't one that can be expressed as a label on an element in the framework.

Remember that our shorthand "Test" refers to multiple distinct things defined in the framework, and this generalization is apt to get us, as it is here, in trouble, by departing from carefully thinking about what things are defined in what ways and what places in the framework.

(1) Yes We should explicitly reference the thing, but as a validation, measure, amendment, or issue, not as a test. "Test" is too general, and as in this discussion, gets us, and will get immplementors and consumers of results into trouble.

(2) As above, there simply isn't anywhere in the framework to put the thing that composes a criterion and an information element. We, are, however, thinking in RDF, thus open world, so we could frame an additional term that composes a criterion and an information element, and link it into the criterion in context, and attach this description there, but we'd still need to compose it with single record or multi record in the label for the criterion in context (and be very clear that this description isn't for the test as we have defined it, but for some larger set of generalizations the rest of which are undefined), and the label we'd present on the criterion in context would have to have the resource type in it.

(3) I concur for cases where there is a single information element, when we refer to say all of the taxon terms, we probably want to use the composite information elements (for which we already have names and definitions available)..

tucotuco commented 2 years ago

@chicoreus That is all fine, but isn't a simpler solution that gets no one in trouble simply another resource type in addition to single record and multi-record, which have specific uses? Would it not be saner to just add and define a resource type such as "property"? The added power seems well worth the simple addition. If not, the framework seems sadly lacking.

ArthurChapman commented 2 years ago

What are we trying to do with this Description? We haven't had it up till now, and it was suggested we needed a simple description to explain in simple English what the test did without having to go to the Expected response etc. That is where we started, but now we are making it more complicated and I believe getting beyond what we wanted - a simple description of the test in plain English so someone could quickly see what the test did.

If we want a full explanation, tying it to the Framework - then I think we need two "descriptions" a simple one that just says what the test does (as we started to do) and a more complicated one that fully covers every aspect of the Framework.

Currently, we don't mention "single record" in each of the Expected Responses, etc. and neither should we.

Let us not lose sight of why we want this description. Let's keep it simple. If we need a fuller description of each test - then let's discuss that as a separate issue.

chicoreus commented 2 years ago

@tucotuco We'd have to get @allankv to chime in on whether that is possible in the framework. Single record is likely large enough in scope to accomodate a set of related records in a relational database anchored off of say (as this fits the CORE needs), an occurrenceID, or a similar graph in RDF starting off an occurrenceID, it isn't limited to a single flat record, but is a thing that you would want to include or exclude in an analysis depending on whether it had quality for your needs.

I can see an obvious workflow that we've talked about before and you've implemented in the proof of concept sql implementation of the tests, that takes an input data set, takes a term, finds unique values, and then applies the tests to those unique values. With the current validation definitions that we have and their expression in the in the data quality needs and the data quality report level in the framework, I don't think there's a way to assert the results on that aggregate level, to report on the validations as we have them defined, the distinct values need to be unpacked and then an assertion made at the single record level.

There is very likely a good generalization that can be added, we know that each validation we define will be coupled with at least one, and probably two measures (one numeric percent compliant, one complete/not complete), that fall into the same test family as the validation, except they apply on a multi record (and assess the validations) (and are simple to generate, so we haven't discussed them for some years), but there is also likely a good generalization that expresses a criterion and an information element and which can be composed with a single record, or with distinct values in a multi record, where the proposal in the validation that makes an assertion about distinct values in a multi record says something like "none of the distinct values in your data for this term conform to the expected vocabulary", and couple it with an amendment that might assert "you can make your data compliant by making this set of changes to the distinct values found in this term in your data, and this might be accomplished by changing the mapping of the source data onto the term" - but both of these fit only to quality assurance, not to quality control, while the validations as framed, can apply to both, and a consumer of a data quality report knows that NOT_COMPLIANT on a validation for a single record means the single record is not fit for their purpose and can be filtered out (until the measure for that validation is COMPLETE, where all records in the filtered multi record are COMPLIANT, or a numeric measure hits 100%).

chicoreus commented 2 years ago

@ArthurChapman "Currently, we don't mention "single record" in each of the Expected Responses, etc. and neither should we." We don't because it is another element in the test definition, it is the Resource Type, not the Specification. We've been lax with the framework, but we are coming back to a point where we have to get back to tightly conforming to it. We pulled the resource type out of the markdown tables in the github issues because it was the same for every test that we were defining, but in the export that is conformed to the framework, and in the draft rdf, it is added back in.

Remember that our Validations are descriptions of Criterion in Context at the data quality needs level, Specification at the data quality mechanisms level, and Validation at the data quality report level. they aren't simple test descriptions.

chicoreus commented 2 years ago

@ArthurChapman "What are we trying to do with this Description?" I see that we need (1) an rdfs:label that can be applied to a ContextualizedCriterion, and (2) this label can be presented to human consumers of descriptions the validation in a description of the data quality needs met by the core tests, or as metadata about the validation in a data quality report from a mechanism that ran the core tests to fit a data quality need.

The Specification (the "Expected Response" in the markdown tables in the github issues), is the description of the validation for implementors, the Description is the parallel metadata about the ContextualizedCriterion+Specification+Validation for end users.

We could do something else with the description, but when we compose the formal definitions of the validations, we won't have anywhere to put the description.

Similarly for measures, and amendments, and issues, except they hang off of parallel elements in the framework, rather than the criterion in context and validation.

chicoreus commented 2 years ago

@tucotuco "property" as a resource type could probably work for quality assurance, but I can't see how it would work formally with quality control, and for a definition of a validation for quality assurance, the resource type of single record is probably broad enough in scope to accomodate a validation that tests whether the value in a single darwin core term is compliant with expectations or not, and and amendment, which if not, could assert actions to take on that property. For quality control, I don't see how filtering to include or exclude a property from an analysis would work. Again, @allankv is the expert on the framework.

chicoreus commented 2 years ago

@ArthurChapman In short, we need a concise description of what each validation, amendment, measure, issue does, intended for end users. For a validation, this needs to include that it is a validation, that it operates on a single record, what information elements it examines, and what the criteria for COMPLIANT are.

ArthurChapman commented 2 years ago

@chicoreus I can still see value in a simple description - perhaps we just need to call it something else "Summary description", .... ? What you are suggesting is nearly a formularic description (more along the lines of your spreadsheet with some english words in between) - but for the average user, I am sure they would still be confused as to "what does this test do in simple terms" without having to know the details of how the Framework works, etc. We don't want people (those making decisions at the curator level without necessarily being the person with the technical expertise who implements it) turned off from using the tests to improve their data.

ArthurChapman commented 2 years ago

So @chicoreus - you are saying as a minimum - for this test we need - the four key elements bolded

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters of a single record is within a valid rangeI

I could live with that

ArthurChapman commented 2 years ago

Or - as this is paramaterized

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters of a single record is within a valid specified range I

Tasilee commented 2 years ago

Why not be explicit?

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters is in the range bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters. |

or even larger than ... and less than?

ArthurChapman commented 2 years ago

I'd prefer to not add new terms that we have to define - I think "in the specified range" is OK or you could say

| Description | A validation test to determine if the value of a single record of dwc:maximumElevationInMeters is within a specified parameter range I

to make it more prescriptive and fits closely to the Expected Response - could say

| Description | A validation test to determine if the value of a single record of dwc:maximumElevationInMeters is a number within a specified parameter range I

chicoreus commented 2 years ago

Looking at what I'm doing in labeling the results in MCZbase, I'm thinking @tucotuco is right, there is a great deal of value in a short composition of the information element with the criterion. This makes sense to me in two ways, one in his sense of seeking to generalize, and second as a short form label when the context of validation and single record are known.

In the display of results from event_date_qc tests in MCZbase, I'm using the following (currently hard coded, so it would be nice to be able to look these up) values to label a row in a table of validation results for a single record (e.g. https://github.com/MCZbase/MCZbase/blob/f009cd9aec5901ae19d9a8ea9174bff892865ef5/dataquality/component/functions.cfc#L504 ) as, for the TIME tests:

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters is in the range bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters. | | Brief | dwc:maximumElevationInMeters is in range |

Description could give us the label for criterion in context, and Brief could give us either/both the generalization that @tucotuco is looking for and a short label for consumers of validation results (where Validation and Single Record are evident from the context).

ArthurChapman commented 2 years ago

Looks reasonable @chicoreus - My previous comment about adding and having to define new terms such as bdq:minimumValidElevationInMeters and bdq:maximumValidElevationInMeters doesn't apply - as I just checked #152 and we do have these defined there.

Note, however your first example "dwc:eventDate precision in seconds" should be "duration in seconds"

chicoreus commented 2 years ago

On Mon, 21 Mar 2022 18:53:52 -0700 Arthur Chapman @.***> wrote:

Note, however your first example "dwc:eventDate precision in seconds" should be "duration in seconds"

Exactly the trouble with hard coded things. I haven't caught the MCZbase code up with the tests, if there is a property of the validation that can be looked up, the labels don't need to be hard coded in anyone's implementations....

ArthurChapman commented 2 years ago

@chicoreus A pity we didn't start a new issue for this discussion

Was one of your last posts suggesting two fields a Description and a Brief? I see @tucotuco responded with a heart!

So in summary: #112

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters is in the range
bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters. |

| Brief | dwc:maximumElevationInMeters is in range |

So questions to @chicoreus

  1. Do we label one "Description" and one "Brief" or should Brief be called something else?
  2. I take it from the above that you have now dropped the requirement for "single record"?
  3. For an amendment (#163)

| Description | A test that amends the value of dwc:taxonRank to unambiguously conform to the corresponding value provided from a specified bdq:sourceAuthority. I

| Brief | dwc:taxonRank amended to standard value|

chicoreus commented 2 years ago

@ArthurChapman that's what happens too late at night.... Yes, I'm suggesting Description + Brief. No, Description must include the resource type (single record).

(1) I'm agnositic about what they are called, "Description" is good for the rdfs:label (or skos:label) for the criterion in context. We can probably find a better word than "Brief", but it does carry the intent.

(2) No, Description must have the 4 elements (Validation, information element, resource type, and criterion) (amendments, issues, and measures having parallel terms to criterion in context). Thus:

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters of a single record is within a valid range I

| Brief | dwc:maximumElevationInMeters is within range I

(3) The parallel term for criterion in context for amendments in Enhancement in Context (ContexturalizedEnhancement), it composes an enhancement (parallel to criterion) with an information element and a resource type., so yes, the description would be parallel, however we will want to be careful that we use language that asserts that a change is proposed rather than language that asserts that data has been changed:

| Description | A test that proposes to amend the value of dwc:taxonRank in a single record to unambiguously conform to the corresponding value provided from a specified bdq:sourceAuthority. I

| Brief | amendment proposed for dwc:taxonRank to standard value|

For end users, we probably want to maintain the blurring of Enhancment (the test described at the data quality needs level) with Amendment (the response at the data quality reports level) by using the informat "proposes to amend" rather than "An Enhancement to the value of dwc:taxonRank in a single record by proposing to conform it to..." Though the latter language could work.

ArthurChapman commented 2 years ago

For consistency we have different Expected Response wording between this test and the equivalent Depth test. In the Depth test we the extra wording of "of bdq:minimumValidDepthInMeters to bdq:maximumValidDepthInMeters inclusive"

112

INTERNAL_PREREQUISITES_NOT_MET if dwc:maximumElevationInMeters is EMPTY or the value cannot be interpreted as a number; COMPLIANT if the value of dwc:maximumElevationInMeters is within the Parameter range; otherwise NOT_COMPLIANT

187

INTERNAL_PREREQUISITES_NOT_MET if dwc:maximumDepthInMeters is EMPTY or is not interpretable as a number; COMPLIANT if the value of dwc:maximumDepthInMeters is within the Parameter range of bdq:minimumValidDepthInMeters to bdq:maximumValidDepthInMeters inclusive; otherwise NOT_COMPLIANT

Tasilee commented 2 years ago

I tend to prefer #112 as its usage is consistent with referencing other specifications rather than repeating them. The latter is 'spelled out'; more explicit. I see a case for either.

chicoreus commented 2 years ago

As guidance for implementors, the language "within the range of bdq:minimumValidDepthInMeters to bdq:maximumValidDepthInMeters inclusive;" feels clearer than "within the Parameter range;"

ArthurChapman commented 2 years ago

I agree with @chicoreus on this one - prefer wording of #187 with "Parameter removed so

112 would become (also with "inclusive" included)

INTERNAL_PREREQUISITES_NOT_MET if dwc:maximumDepthInMeters is EMPTY or is not interpretable as a number; COMPLIANT if the value of dwc:maximumDepthInMeters is within the range of bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters inclusive; otherwise NOT_COMPLIANT

and #187 (Parameter removed) INTERNAL_PREREQUISITES_NOT_MET if dwc:maximumDepthInMeters is EMPTY or is not interpretable as a number; COMPLIANT if the value of dwc:maximumDepthInMeters is within the range of bdq:minimumValidDepthInMeters to bdq:maximumValidDepthInMeters inclusive; otherwise NOT_COMPLIANT

ArthurChapman commented 2 years ago

NB This also applies to #107 and #39. All 4 should be formatted the same.

ArthurChapman commented 2 years ago

Added "of bdq:minimumValidElevationInMeters to bdq:maximumValidElevationInMeters inclusive" to end of COMPLIANT for consistency with other, related tests.

Tasilee commented 2 years ago

Decision from Zoom: Changed Warning Type from "Unlikely" to "Invalid". Ditto #187

chicoreus commented 1 year ago

I like that.

On Mon, 21 Mar 2022 17:22:00 -0700 Arthur Chapman @.***> wrote:

So @chicoreus - you are saying as a minimum - for this test we need - the four key elements bolded

| Description | A validation test to determine if the value of dwc:maximumElevationInMeters of a single record is within a valid rangeI

I could live with that

Tasilee commented 1 year ago

Restructured Parameter(s) and Source authority

Tasilee commented 12 months ago

Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted".

Also changed "Field" to "TestField", "Output Type" to "TestType" and updated "Specification Last Updated"