tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-VALIDATION_DAY_STANDARD #147

Open ArthurChapman opened 6 years ago

ArthurChapman commented 6 years ago
TestField Value
GUID 47ff73ba-0028-4f79-9ce1-ee7008d66498
Label VALIDATION_DAY_STANDARD
Description Is the value of dwc:day an integer between 1 and 31 inclusive?
TestType Validation
Darwin Core Class dwc:Event
Information Elements ActedUpon dwc:day
Information Elements Consulted
Expected Response INTERNAL_PREREQUISITES_NOT_MET if dwc:day is bdq:Empty; COMPLIANT if the value of the field dwc:day is an integer between 1 and 31 inclusive; otherwise NOT_COMPLIANT.
Data Quality Dimension Conformance
Term-Actions DAY_STANDARD
Parameter(s)
Source Authority
Specification Last Updated 2023-09-18
Examples [dwc:day="15": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:day is in range"]
[dwc:day="32": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:day is not in range"]
Source TDWG2018 DQIG Meeting; TG2-Gainesville
References
Example Implementations (Mechanisms) FilteredPush/Kurator:event_date_qc 10.5281/zenodo.596795.
Link to Specification Source Code event_date_qc DwCEventDQ.validationDayStandard()
Notes This is part of a group of similar tests (VALIDATION_DAY_INRANGE (8d787cb5-73e2-4c39-9cd1-67c7361dc02e), VALIDATION_STARTDAYOFYEAR_INRANGE (85803c7e-2a5a-42e1-b8d3-299a44cafc46), VALIDATION_ENDDAYOFYEAR_INRANGE (9a39d88c-7eee-46df-b32a-c109f9f81fb8)).
ArthurChapman commented 6 years ago

We have changed the description of this from "is not an unambiguous value" to "is not an integer between 1 and 31 inclusive.

chicoreus commented 6 years ago

Isn't this a duplicate of #125 ?

chicoreus commented 6 years ago

Added guid.

chicoreus commented 5 years ago

Field not present as a prerequisite is a significant barrier for implemenations. An implementation of an atomic test as a method which takes values as parameters e.g. validateDayNotStandard(day), is very likely to be isolated from a larger framework which understands whether or not a term is present in an input data set or not. Bringing "is present" into the test definition conflates areas of concern.

This gets back to the need for a definition of isEmpty - if we define isEmpty to include null values, empty strings, and cases where the term isn't present in the data set then we can consistently define prerequisites as being not empty.

I thus propose that we change the element of the definition "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is not present; " to "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is EMPTY;", reference a standard definition of EMPTY, and propagate this change to all other test definitions.

tucotuco commented 5 years ago

@chicoreus I see what you mean about being able to abstract the test in code be independent of the record structure that it might act on. Defining EMPTY to include when a field isn't present seems an elegant simplification that I don't anticipate would cause any issues. I mean, once it comes to the test implementation, it doesn't much matter of there was a source value that was null or if there was no source field. Making the change would likely simplify all the descriptions that reference 'not present'. Downside: it means a review and rewrite of a bunch of tests - it looks like there might be 67 tests affected. I would not wish that on Lee or Arthur.

Tasilee commented 5 years ago

If setting up a definition of EMPTY will consolidate intent, so be it. I am happy to work through the tests and change the Expected response wording.

But how best to do it? My initial reaction is to add a new Issue for EMPTY that can therefore be referenced, otherwise, in relation to this work, it will be hanging?

What would the definition be?

EMPTY: Field not present Field contains a string of one or more null characters Field contains a string of one or more blank characters ...

tucotuco commented 5 years ago

Hmm. Not sure a label is necessary since we can find them in a string search using 'EMPTY'.

As for a definition, how about:

EMPTY: A field that is needed as input is not present, or, the input field is present but there is no value in the field, or the field is present and the value of the field consists entirely of non-printing characters.

On Wed, Mar 27, 2019 at 4:51 PM Lee Belbin notifications@github.com wrote:

If setting up a definition of EMPTY will consolidate intent, so be it. I am happy to work through the tests and change the Expected response wording.

But how best to do it? My initial reaction is to add a new Issue for EMPTY that can therefore be referenced, otherwise, in relation to this work, it will be hanging?

What would the definition be?

EMPTY: Field not present Field contains a string of one or more null characters Field contains a string of one or more blank characters ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/147#issuecomment-477322154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6w4RqyQm7kSnAAbfFyJg23Q-eX39ks5va8vGgaJpZM4WMu13 .

Tasilee commented 5 years ago

Thanks @tucotuco. I like the definition but where do we put it so that it is at least as referenceable as the tests and comments?

ArthurChapman commented 5 years ago

It goes into the Vocabulary #152

Tasilee commented 5 years ago

Thanks @ArthurChapman. Duh, I forgot about that :|. Will do it now.

ArthurChapman commented 5 years ago

But note we already have a definition given by @chicoreus. Perhaps we need to discuss more - see previous discussion in #111

Tasilee commented 5 years ago

I like @tucotuco 's :) as I can easily understand it, and added it to #111 and the vocab #152. Lets see what @chicoreus says. I'm sure he will find an issue? Ha.

ArthurChapman commented 5 years ago

Problem is, @tucotuco doesn't cover everything - like 9999, etc. and some other things that were in @chicoreus version. That is why we need to discuss more fully. @tucotuco and I will look at it more tomorrow.

chicoreus commented 5 years ago

I think a general definition of EMPTY on the line of @tucotuco's would work, if accompanied by notes about implementations in different languages (empty string, null (if the language supports it e.g in java, javascript, sql), undefined (e.g. in javascript), a char array of size 0 (e.g. in C)). I'm not quite sure, however, about non-printing characters (if using either an ascii or unicode definition), as those are characters and constitute data, and could represent mistyped values (e.g. something that's supposed to be a small integer that is being represented as a low ascii character). I'd be inclined to treat a non-printing character as non-empty and then let tests downstream fail. Based on the data I've seen in the wild (and @tucotuco has seen plenty more), I'd propose including several strings that are typical string serializations of a database null ("\N", "null", "NULL"). I also put on the table, but am not particularly happy with other common serializations of empty value markers ("**", "9999", etc.). The trouble with these is that some of them might be valid data values for some fields, so they don't seem to fit a generalized definition of empty.

tucotuco commented 5 years ago

I agree, non-printing characters are not best included for prerequisites not being met. It would not allow us to report that the content (and there is content) is not compliant, and that is what we really want the recipient to know. I think the same applies for all the string serializations of a database null, now that I think more about it. That is something being shared that is not compliant also.

On Thu, Mar 28, 2019 at 12:56 AM Paul J. Morris notifications@github.com wrote:

I think a general definition of EMPTY on the line of @tucotuco https://github.com/tucotuco's would work, if accompanied by notes about implementations in different languages (empty string, null (if the language supports it e.g in java, javascript, sql), undefined (e.g. in javascript), a char array of size 0 (e.g. in C)). I'm not quite sure, however, about non-printing characters (if using either an ascii or unicode definition), as those are characters and constitute data, and could represent mistyped values (e.g. something that's supposed to be a small integer that is being represented as a low ascii character). I'd be inclined to treat a non-printing character as non-empty and then let tests downstream fail. Based on the data I've seen in the wild (and @tucotuco https://github.com/tucotuco has seen plenty more), I'd propose including several strings that are typical string serializations of a database null ("\N", "null", "NULL"). I also put on the table, but am not particularly happy with other common serializations of empty value markers ("**", "9999", etc.). The trouble with these is that some of them might be valid data values for some fields, so they don't seem to fit a generalized definition of empty.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/147#issuecomment-477438069, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP65YJtOEQLgYsSQlH83w6ncge3yeyks5vbD1tgaJpZM4WMu13 .

tucotuco commented 5 years ago

Following the discussions arising from the event date case study for the BISS paper, I believe that this test should be updated to use the expected response and notes of TG2-VALIDATION_DAY_OUTOFRANGE (#125), which should then be deprecated. Everything else about this test is already up to date with the proposed change.

ArthurChapman commented 5 years ago

Agreed @tucotuco

tucotuco commented 5 years ago

Updated to incorporate TG2_VALIDATION_DAY_OUTOFRANGE (https://github.com/tdwg/bdq/issues/125).

chicoreus commented 5 years ago

@tucotuco

Defining EMPTY to include when a field isn't present seems an elegant simplification that I don't anticipate would cause any issues.

Th Java QC libraries used in Kurator are not able to assess whether or not a field is present or not in the input. At the point at which the tests are run, empty strings are passed in to the test code for either a field not present in the input or a field with an empty value. Likewise the older FilteredPush implementations. Assessing for the presence or absence of a data field in an input record imposes significantly different engineering constraints on a system than tests assessing whether or not a value is presented to the test.

Tasilee commented 5 years ago

@chicoreus: I'm presuming you are suggesting the removal of all phrases like "dwc:xxx is not present"? It does in retrospect seem odd that we have defined EMPTY but not PRESENT.

chicoreus commented 5 years ago

@Tasilee Yes, i would very strongly advocate the removal of "is not present" everywhere it occurs. I would further advocate that we ensure that the definition of isEmpty()/EMPTY allows test implementations to treat the absence of a term in a single record as an empty value without making a distinction between the term being present with no value and the term being absent. This allows tests to be implemented as methods with signatures that take a list of terms (e.g. validationDayNotStandard(day, month, year), rather than as methods that take Darwin Core records and internally extract the terms from them (which is what we would be forced into by assessing the presence/absence of terms at the level of each test).

Presence/absence of terms related to the core in a MultiRecord feels like a distinct test related to only a particular subset of data preparation pipelines, and not being a question about the utility of the data for consumers (e.g. all research consumers would care about is the presence of values for dwc:eventDate, while year, month, and day are primarily concerns of upstream providers and aggregators attempting to map disparate schemas onto Darwin Core and then to aggregate disparate mappings onto a common provision of dwc:eventDate).

I will make two claims: (1) Including a test of terms being present as part of the assessment of emptyness of terms at the level of individual tests puts severe engineering constraints on the implementations of those tests (e.g. @tucotuco couldn't have assessed the difference between term present/absent and term value = null in his sql implementation of the subset of TIME tests). And (2) assessing whether terms are present or absent in a particular SingleRecord or MultiRecord is out of scope for the CORE tests as it relates to a subset of use cases around mapping the data that are out of scope of the set of uses for the data that we examined for CORE.

(1) Says we can define a separate test for presence of terms in a data serialization. (2), if accepted, says we can ignore this issue and just drop is present, (2) if rejected says we should define a separate test for the presence/absence of the information elements/darwin core terms we identified as the set relevant to the core tests.

tucotuco commented 5 years ago

I think it simplifies everything to include in our EMPTY definition the case where a term is not present and remove what would then be superfluous wording.

Tasilee commented 5 years ago

Thanks @chicoreus. I always need to read your comments 4 times to get my head around the concepts. Yes, you all know I am slow by now. Anyway, I think I agree with the conclusions and with @tucotuco. I will need to await clarification of the EXTERNAL_PREREQUISITES question before editing. I don't want to do two passes! Again. I'll leave the definitions to @ArthurChapman and @tucotuco

ArthurChapman commented 5 years ago

OK - I will attempt to change the definition of Empty in #152 (and in the BISS Paper)

ArthurChapman commented 5 years ago

I have changed the definition of EMPTY to: A field that is either not present or does not contain any characters or values. Note: A field containing non-printing or other invalid characters or values (including NULL values) may be separately detected.

Tasilee commented 5 years ago

Do we need a note about @chicoreus comment on difficulty of implementing a concept of fields that are 'not present'

chicoreus commented 5 years ago

That is probably very worth putting into an informative document as part of the standard.

chicoreus commented 5 years ago

This issue is still conflating standardization (unambiguously interpreting text strings as integers) with a test of range (does the day exist for the given month and year).

We should have a test validation_day_outofrange(day, month, year) testing the latter (day exists for month and year), and this test #147 validation_day_notstandard(day) paired with #127 amendment_day_standardized(day) handling cases where values like "1st" can be unambigously interpreted as 1 (and for such, we should use the language unambigously interpreted instead of cast. Cast has a specific meaning in typed programing languages. @ArthurChapman somewhere mentioned cast of the string "1.240" representing a decimal number to the integer 1, which is likely to fail when presented with a european format "1,1240" string, and in many implementations of cast of the string to an integer will fail as the string doesn't contain an integer "1" might work, "1.0" is unlikely to. Actual implementations are likely to use parsing of numbers from strings rather than cast, and the expected case for any Darwin Core input values is as data typed as strings rather than as numeric types (given the absence of numeric types in the formal definitions, and the widespread presence of untyped values in csv files in Darwin Core Archives...)..

ArthurChapman commented 5 years ago

@chicoreus Check out #125 which was deprecated by @tucotuco . Are you suggesting it be resurected? Check out discussions there

chicoreus commented 5 years ago

We need both this and #125. The validation #125 is needed to test whether a value for day exists for the given month and year. It must treat non-integer values as internal prerequisites not met. This test validation #147 (which should take only dwc:day as a parameter and test whether the value of day is an integer) forms a couplet with the amendment #127 (which can propose the replacement of a non standard non-integer value with an integer (e.g. 2nd to 2)). This pairing of a validation and an amendment where the validation tests for exactly what the amendment can correct is critical to measuring improvement of quality. The test (#125) for the (not able to be corrected from these three fields alone, and thus not paired with an amendment) existence of a day value given the month and year is a distinct test, and we need both.

chicoreus commented 5 years ago

This test needs to be VALIDATION_DAY_NOTSTANDARD(dwc:day). Test #125 needs to be VALIDATION_DAY_OUTOFRANGE(dwc:day, dwc:month: dwc:year). Another name for #125 is validation_day_existsformonthyear (not suggesting we adopt that, but it is less confusing).

ArthurChapman commented 5 years ago

In the Expected Response. we have replaced "cast" with either "ambiguously interpreted" or "interpreted" depending on contect

chicoreus commented 5 years ago

I propose (1) Removing year and month from the information elements. (2) changing the specification from:

INTERNAL_PREREQUISITES_NOT_MET if a) dwc:day is EMPTY, or b) dwc:day can not be unambiguously interpreted as an integer, or c) dwc:day can be interpreted as an integer between 29 and 31 inclusive and dwc:month can not be interpreted as an integer between 1 and 12, or d) dwc:month can be interpreted as the integer 2 and dwc:month can be interpreted as the integer 29 and dwc:year can not be unambiguously interpreted as a valid ISO 8601 year; COMPLIANT e) if the value of the field dwc:day can be interpreted as an integer between 1 and 28 inclusive, or f) dwc:day can be interpreted as an integer between 29 and 30 and dwc:month can be interpreted as one of (4,6,9,11), or g) dwc:day can be interpreted as an integer between 29 and 31 and dwc:month can be interpreted as one of (1,3,5,7,8,10,12), or h) dwc:day can be interpreted as the integer 29 and dwc:month can be interpreted as the integer 2 and dwc:year is a valid leap year (evenly divisible by 400 or (evenly divisible by 4 but not evenly divisible by 100)); otherwise NOT_COMPLIANT

to:

INTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer in the between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

chicoreus commented 5 years ago

Contrast proposed specification with #125 VALIDATION_DAY_OUTOFRANGE (which checks if the day exists given the month and the year), and compare with the parallel specification in the amendment #127 AMENDMENT_DAY_STANDARDIZED.

Tasilee commented 5 years ago

@chicoreus: Your logic makes sense. #125 is doing the complex range checking, while this one is detecting a simpler scenario. I'll edit the Expected Response accordingly as suggested.

chicoreus commented 5 years ago

Removing year and month from information elements. Minor edit of specification (s/II/I/ s/in the between/between) from:

IINTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer in the between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

to:

INTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

tucotuco commented 4 years ago

@timrobertson100 Noted that this test still showed a dependency on year and month, which we explicitly removed after long deliberations. I have updated the Notes from

"This test must take into account the given month and year, if present, to account for leap years. This is part of a group of similar tests (#125, #130, #131)."

to

"This is part of a group of similar tests (#125, #130, #131)."

to reflect this.

ArthurChapman commented 1 year ago

Changed Note from:

"This is part of a group of similar tests (#125 #130, #131)." to

"This is part of a group of similar tests (VALIDATION_DAY_INRANGE (8d787cb5-73e2-4c39-9cd1-67c7361dc02e), VALIDATION_STARTDAYOFYEAR_INRANGE )85803c7e-2a5a-42e1-b8d3-299a44cafc46, VALIDATION_ENDDAYOFYEAR_INRANGE (9a39d88c-7eee-46df-b32a-c109f9f81fb8))."

To removed internal GitHub links

Tasilee commented 1 year ago

Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted".

Also changed "Field" to "TestField", "Output Type" to "TestType" and updated "Specification Last Updated"