tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-VALIDATION_TYPESTATUS_STANDARD #285

Open ArthurChapman opened 7 months ago

ArthurChapman commented 7 months ago
TestField Value
GUID 4833a522-12eb-4fe0-b4cf-7f7a337a6048
Label VALIDATION_TYPESTATUS_STANDARD
Description Does the value of dwc:typeStatus occur in bdq:sourceAuthority?
TestType Validation
Darwin Core Class dwc:Occurrence
Information Elements ActedUpon dwc:typeStatus
Information Elements Consulted
Expected Response EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:typeStatus is bdq:Empty; COMPLIANT if the value of the first word in each | delimited portion of dwc:typeStatus is in the bdq:sourceAuthority; otherwise NOT_COMPLIANT.
Data Quality Dimension Conformance
Term-Actions TYPESTATUS_STANDARD
Parameter(s) bdq:sourceAuthority
Source Authority bdq:sourceAuthority default = "Darwin Core typeStatus" {[https://dwc.tdwg.org/list/#dwc_typeStatus]} {dwc:typeStatus vocabulary API [https://gbif.github.io/parsers/apidocs/org/gbif/api/vocabulary/TypeStatus.html]}
Specification Last Updated 2024-08-03
Examples [dwc:typeStatus="Holotype": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:typeStatus found in the bdq:sourceAuthority"]
[dwc:typeStatus="cleptotype": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:typeStatus not found in the bdq:sourceAuthority"]
Source ALA, GBIF
References
Example Implementations (Mechanisms)
Link to Specification Source Code
Notes This test must return NOT_COMPLIANT if there is leading or trailing whitespace or there are leading or trailing non-printing characters.
ArthurChapman commented 7 months ago

@chicoreus @tucotuco I'm not sure that the link I have for the API is an actual API or if one exists (https://gbif.github.io/parsers/apidocs/org/gbif/api/vocabulary/TypeStatus.htm) thus the NEEDS WORK label

chicoreus commented 6 months ago

This is a case where "INTERNAL_PREREQUISITES_NOT_MET if dwc:typeStatus is EMPTY" is very likely a non-helpful part of the response. Data has quality for use if the response result is COMPLIANT. Any other value implies that the data is not fit for purpose. Anywhere where data are sparse, as with type status on occurence data, the absence of a value does not indicate a data quality problem. This test and others like it where data values are very likely to be correctly sparse should return COMPLIANT when the information element acted upon is empty.

chicoreus commented 6 months ago

In general, we need to review the use of INTERNAL_PREREQUISITES_NOT_MET for empty values across all of the test definitions to date. When data should have a value most of the time (even when this aspirational and much data in the wild doesn't have a value, as in some of the georeference metadata (even then we need to make sure that there is a georeference before asserting that the absence of georeference metadata is a problem)), then we should be using INTERNAL_PREREQUISITES_NOT_MET for empty values of the information element acted upon. But, when data are expected to be correctly sparse, as here, then COMPLIANT is a much more appropriate response result for empty values.

Tasilee commented 6 months ago

@chicoreus - To me, an EMPTY value triggering INTERNAL_PREREQUISITES_NOT_MET states that the test was not run (further). We have no RUN_HAS_RESULT. This makes no judgement about 'quality' or lack of it. The reason for INTERNAL_PREREQUISITES_NOT_MET was I thought the statement "We are unable to comment on the Information Element Acted Upon". This seems appropriate.

The question remains is this test Supplementary or CORE (aspirational)?

Tasilee commented 6 months ago

The issue seems to be the triplicate

...NOTEMPTY ...STANDARD ...STANDARDIZED

To me, the anomaly is the NOTEMPTY tests that have RUN_HAS_RESULT=NOT_COMPLIANT. The point @chicoreus raised about an EMPTY value potentially not detracting from 'quality' can apply. Not so for the STANDARD and STANDARDIZED tests.

The point being those NOTEMPTY CORE tests would (I think) be aspirational in that a NOT_COMPLIANT response is something we feel needs to be flagged, and values encouraged.

With tests like #289, they would indeed need to be Supplementary given the lack of records with values.

ArthurChapman commented 6 months ago

The number of specimens with a value in dwc:typeStatus would be very low, and no observations. Thus the test for NOTEMPTY (#246) would definitely be Supplementary as "likely to return a high percentage of either .... bdq:NOT_COMPLIANT results". If we followed a workflow that would run: ...NOTEMPTY test followed by running those where the result was COMPLIANT through the test for STANDARD ... etc. - it would be a different matter.

BUT - we don't follow a workflow. That is, each test is standalone, running this test has great value. 99.9% would have a result of INTERNAL_PREREQUISITES_NOT_MET - OK - not a problem because this is not unexpected as most specimens and all Observations rightly have no Type Status. Knowing, however that the other 0.1% that have something in dwc:typeStatus follow the Standard or not, is important and says a lot about the data quality.

For that reason, I am tempted to suggest that this test could be CORE, but should include a note that it is expected that most results would return a INTERNAL_PREREQUISITES_NOT_MET result.

chicoreus commented 6 months ago

@Tasilee In the Framework, under quality control, NOT_COMPLIANT values point out aspects of the data that need improvement for the data to fit the needs of the UseCase, so INTERNAL_PREREQUSITES_NOT_MET can be ignored (or made the responsibility of another test). The thing we need to think through is QualityAssurance, where the data are filtered down so that all records are COMPLIANT on all Validations, validations that return INTERNAL_PREREQUSITES_NOT_MET mean that the data lack quality for the use. When data are expected to be densely populated, and a VALIDATION_X_STANDARD is coupled with a VALIDATION_X_NOTEMPTY, this isn't important, as empty values will come up as NOT_COMPLIANT on the paired VALIDATION_X_NOTEMPTY test. But when data are expected to be sparsely populated, and the UseCase isn't paring a VALIDATION_X_STANDARD with a VALIDATION_X_NOTEMPTY, and the VALIDATION_X_STANDARD stands alone, then any data which correctly has no value will be excluded under the filtering for COMPLIANT only records under QualityAssurance. So, for sparse data without a paired NOTEMPTY test, we need to allow the VALIDATION_X_STANDARD to treat empty values as compliant, or assess from other terms whether a value should be present (as in some of the metadata tests).

Currently tagged as core, but has the " is not regarded as CORE" text in the note.

ArthurChapman commented 6 months ago

See comment at https://github.com/tdwg/bdq/issues/284#issuecomment-1959199917 where a Vocabulary is being developed by GBIF. Perhaps, pending that, this should be Immature/Incomplete?

chicoreus commented 6 months ago

Where there is an existing vocabulary, we shouldn't, even if an improved vocabulary is being developed. The test can be described and implemented with the currently available vocabulary as a source authority. When a new vocabulary comes out, the source authority can be updated. This may not apply for standardized, as an existing vocabulary may or may not support standardization in the way desired by the test.

On Thu, 22 Feb 2024 12:39:11 -0800 Arthur Chapman @.***> wrote:

See comment at https://github.com/tdwg/bdq/issues/284#issuecomment-1959199917 where a Vocabulary is being developed by GBIF. Perhaps, pending that, this should be Immature/Incomplete?

Tasilee commented 6 months ago

GBIF has dwc:typeStatus as not EMPTY for ~0.7% of records and ALA 0.5%. If this is truly aspirational, fine, CORE, otherwise Supplementary on the basis of proportion of likely INTERNAL_PREREQUISITES_NOT_MET.

Decision please?

chicoreus commented 1 month ago

Similar concern to #286. Not sure that this is tractable. The expectation for values in dwc:typeStatus is a pipe delimited list of {type status term of taxon name {publication}}. The definition explicitly includes the taxon name as part of the expected value: "A list (concatenated and separated) of nomenclatural types (type status, typified scientific name, publication) applied to the subject."

One example includes citation information, the other just type status term and taxon name.

For just type status terms and taxon names, we could probably manage with two source authorities, one for the type status term and one for the taxon name, but with publication citations included, that will not be tractable.

We might get away with testing the first word of each pipe delimited block to a type status term vocabulary.

Examples in Darwin Core are:

 holotype of Ctenomys sociabilis. Pearson O. P., and M. I. Christie. 1985. Historia Natural, 5(37):388

 holotype of Pinus abies | holotype of Picea abies
chicoreus commented 1 month ago

See discussion in #286

Short of marking as immature/incomplete, an alternative phrasing might be:

EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:typeStatus is EMPTY; COMPLIANT if the value of the first word in each | delimited portion of dwc:typeStatus is in the bdq:sourceAuthority; otherwise NOT_COMPLIANT.

Tasilee commented 1 month ago

Thanks @chicoreus. Your rendering of the Expected response seems appropriate given comments in https://github.com/tdwg/dwc/issues/28, but I would defer to others in the team. Whatever we do here will apply to #286, but I presume would be harder to implement?

chicoreus commented 1 month ago

Slightly harder, but not intractable.

Tasilee commented 1 month ago

No further comments on @chicoreus reasonable suggestion, so I'm changing the Expected Response from

EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:typeStatus is EMPTY; COMPLIANT if the value of dwc:typeStatus is in the bdq:sourceAuthority; otherwise NOT_COMPLIANT.

to

EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:typeStatus is EMPTY; COMPLIANT if the value of the first word in each | delimited portion of dwc:typeStatus is in the bdq:sourceAuthority; otherwise NOT_COMPLIANT.

Also updated Specification Last Updated.