tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-MEASURE_DWC_COMPLETENESS #85

Closed iDigBioBot closed 4 years ago

iDigBioBot commented 6 years ago
TestField Value
GUID 28a42e6e-1a79-4728-ab91-aa7f191818de
Label MEASURE_DWC_COMPLETENESS
Description How many Darwin Core terms have a value in them.
TestType Measure
Darwin Core Class All
Information Elements ActedUpon AllDarwinCoreTerms
Information Elements Consulted
Expected Response The number of Darwin Core terms that are bdq:NotEmpty in the record
Data Quality Dimension Completeness
Term-Actions DWC_COMPLETENESS
Specification Last Updated 2024-02-20
Examples [dwc:eventDate="1881-12-15", dwc:scientificNameID="urn:lsid:marinespecies.org:taxname:208134", dwc:decimalLatitude="21.45",dwc:decimalLongitude="": Response.status=RUN_HAS_RESULT, Response.result=3, Response.comment="Three bdq:NotEmpty Darwin Core terms"]
Source @Tasilee
References
Example Implementations (Mechanisms)
Link to Specification Source Code
Notes The maximum value this test may return can vary based on a number of factors including the structure of the data set, flat Darwin Core, star schema, and RDF representations are likely to contain different numbers of Darwin Core terms. MultiRecord measures of the minimum, mean, mode, and maximum values of this SingleRecord measure across a data set may be informative for some uses.
iDigBioBot commented 6 years ago

Comment by Lee Belbin (@Tasilee) migrated from spreadsheet: I would prefer the complement (number of fields in record) as it would usually be less than the number absent

Tasilee commented 4 years ago

I'm writing a paper on the ALA and this aspect of occurrence records has popped up. I was the original proposer of this MEASURE as I thought it would help contribute to any estimate of the overall 'utility' of the record.

The issue was closed but there was no explanation as to why, so I've re-opened it for comment.

ArthurChapman commented 4 years ago

This looks like one we discussed at Gainesville and decided to make not CORE at that stage. I don't have notes on the discussion - so it must have been close to unanimous. I wonder, in the tests, what the value is to someone running their tests as it is only a measure. What are they going to do with the result. Early on, we had lots of measures and cut them down. From memory - I think we wondered the worth if you didn't restrict the terms you checked as in many databases, some terms may not be relevant. As I see it - the only value would be for Aggregators - and there is nothing stopping them from running a separate exercise that may be of more value to them.

Tasilee commented 4 years ago

Thanks @ArthurChapman . I agree that there is a before/after aspect to this MEASURE/S. I just want to make sure we capture some idea why this would not be a useful measure. It would be useful when comparing with other records. For example, when records are listed, it could be a parameter that may help identify some aspect of 'quality'. The negatives are that there is no indication of WHICH Darwin Core terms are filled in. As we all agree, it is the triplicate of NAME-SPACE-TIME terms that are fundamental. We have this covered with the three separate tests.

I've sent the worksheet where this test was raised, but no indication why dropped. The comments were

Inferences should be a distinct measure, with PLENTY of metadata about how the inferences were made (e.g., Darwin Cloud correspondence). I would prefer the complement (number of fields in record) as it would usually be less than the number absent Given the recent script run across all record in the ALA listed counts of 'not EMPTY' across all Darwin Core terms, it is not as easy as first thought. Basic issue is that any count of not EMPTY DwC terms needs to be performed before any amendments. The ALA's classic of filling in all not EMPTY “identification_qualifier” with values: “not provided” is cute and not terribly useful.
ArthurChapman commented 4 years ago

There is also a lot of tests whereby the field is empty (including absent) - either as INTERNAL_PREREQUESITE_NOT_MET of as a requirement before an amendment can be made.

Tasilee commented 4 years ago

Thanks @ArthurChapman. I agree and have raised this with the ALA.

No other comments?

Tasilee commented 4 years ago

On the basis of discussions today, this MEASURE is not sufficiently discriminatory. While it could be used as a basis of record comparison or multi-record summaries, there is no discrimination between the significance of the Darwin Core terms. Two records could produce a similar value with very different information quality. This MEASURE does however look into the future where some estimates of a record score could be assessed for particular applications. For example, for an SDM study, records with accepted and correct names to species level, coordinates with a spatial uncertainty of less than 100m and a date to day level (among other Darwin Core terms) would have high value .

chicoreus commented 7 months ago

Brought markdown table closer to current expectations. Added cautionary note and pointer to potential MultiRecord tests to accompany this one.

ArthurChapman commented 7 months ago

Added "Description"

Tasilee commented 7 months ago

Tweaked ER, removed dependency

Tasilee commented 7 months ago

Changed Test to TestField