Open iDigBioBot opened 6 years ago
TestField | Value |
---|---|
GUID | 3cff4dc4-72e9-4abe-9bf3-8a30f1618432 |
Label | VALIDATION_EVENTDATE_INRANGE |
Description | Is the value of dwc:eventDate entirely with the Parameter Range? |
TestType | Validation |
Darwin Core Class | dwc:Event |
Information Elements ActedUpon | dwc:eventDate |
Information Elements Consulted | |
Expected Response | INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is bdq:Empty or if the value of dwc:eventDate is not a valid ISO 8601 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive, otherwise NOT_COMPLIANT |
Data Quality Dimension | Conformance |
Term-Actions | EVENTDATE_INRANGE |
Parameter(s) | bdq:earliestValidDate |
bdq:latestValidDate | |
Source Authority | bdq:earliestValidDate default ="1582-11-15" |
bdq:latestValidDate default = "{current year}" | |
Specification Last Updated | 2024-09-16 |
Examples | [dwc:eventDate="1962-11-01T10:00-0600": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:eventDate is IN_RANGE"] |
[dwc:eventDate="2300-11-01T10:00": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:eventDate is NOT_IN_RANGE"] | |
Source | VertNet |
References |
|
Example Implementations (Mechanisms) | Kurator:event_date_qc |
Link to Specification Source Code | FilteredPush event_date_qc DwCEventDQ.validationEventdateInrange() |
Notes | This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this. Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold. |
Comment by Lee Belbin (@Tasilee) migrated from spreadsheet: Was thinking of adding a lower bound to make it a more comprehensive test, but could we have fossil eventDate?
Needs clarification for eventDate values which are ranges and which span the oldest/youngest boundaries. For example, 1700-01-01/2100-01-10 is an entirely valid eventDate value with a range which includes all likely specimen collecting dates extant, or for some time into the future. Under the current definition, this value (which is in essence a placeholder for "we don't know what the date was"), fails the test. Similarly 1650-01/1850-02 would be expected to fail, simply because it places a lower bound to the uncertainty earlier than the default 1700. Framing the test to mark as problems any range which extends outside the 1700-present range will potentially encourage people to frame uncertainty about dates too narrowly, instead of setting reasonable uncertainty values for their situation. I'd prefer to just flag eventDate values which fall entirely outside the specified range. Other potential failure cases produced by considering ranges that span the boundaries as problems are an eventDate who's value is the current date, without a time. This is a time interval that extends into the future, and a reasonable implementation of the test as stated would mark any record with an eventDate consisting of the current date without a time as an error - something not desirable when the quality control processes are placed upstream close to initial data capture.
@chicoreus I don't see a problem here - we are not saying it is wrong - just a warning that it is out of range. What is done with that is up to the user, but it flags a possible problem. With annotations - a followup annotation may be that this is OK, because ...
The problem is again on different interpretations of how to represent uncertainty in eventDate values. A European institution with old collections which very reasonably decides to set 1400-01-01 and 2100-01-10 as end boundaries for any events where the collecting date is not known (the 2100 date making these records very easy to find and distinguish from ones which have had the date narrowed based on some additional interpretation), and would have all of these flagged as problems binned in with real problem records such as the typical typo 190-10-01. It is very rational from a database perspective to set an end date at some distant future point for all records with uncertainty, this makes them easy to find and collect). I'm not at all in favor of a position that declares that ranges that fall outside the likely bounds are problems. I'd much rather see a narrower test for intervals that entirely fall outside the range of plausible collecting event dates - that should get a much smaller set of false positives and more effectively identify problematic data that needs to be fixed.
The today's date will fail issue (because today's date to a resolution of one day in an ISO date is a temporal interval that extends into the future, unless special case handling is added for today's date) also makes this test highly problematic for upstream uses near the point of observation.
I can understand that at the dataset level, but would expect it to be very rare at the record level. The earliest date can be a designated date for the run as well if you need to set an earlier date for some reason - or particular dataset. I don't see it as a big issue.
I'm a simple soul. I side with @ArthurChapman. We have to be careful that we don't errect obstacles that eveyone is then forced to climb over. KISS. Others?
Another way of putting the problem I am seeing: By treating any range that extends beyond 1700-today as an error is conflating two classes of problems: (1) errors in accuracy (e.g. 198-10-15), and (2) broad statements about uncertainty (1500/2100). Broad statements about uncertainty are already captured separately with a measure of event duration. I will argue that it is important to be able to identify the first class of error in isolation, by implementing this test (in the easier way) by flagging records who's range falls entirely outside the range 1700-present. The current statement of the test is more complex, as it raises the specter of special case handling of records with today's date. I also like KISS, and argue that the current description isn't the simple one.
About 10% of the MCZ data has an unknown event date, recorded in the database (which enforces a start and end date as oracle date fields) as 1700-01-01/2100-01-01. From a database perspective, this is a very useful pair - it is very easy to extract those 183136 records on the basis of those values, narrowing by any inference makes these harder to locate as a single sort of data quality issue.
OK, I'll buy it (range outside 1700-present) @chicoreus , but I would like to hear from the rest of the team.
How many institutions do this other then MCV? It does seem to be a problem. Under your reasoning @chicoreus - we can't only do "not in future" It would appear to me that the field is being used in ways it was never meant to be used, but I can't see any simple way around it other than to remove this test altogether.
Re-examining this validation, I cannot see a problem with flagging a suspicious date (or date range) that is before 1700 or after the day the test is run. A "NOT COMPLIANT" would seem useful information to follow up on. A false positive flag seems prefereable to me that a false negative where one end of a range is totally outside 1700-today.
Considering #66, I'd be inclined to include invalid dates (e.g., Feb 30) under this test as they are not in the possible range of dates, and they may well be formatted to ISO standard. This would make this validation dependent on #61.
I'll suggest that we split this test into two separate tests, one of which tests whether or not the event date extends outside the boundaries 1700-present, and the other to test whether or not the event date falls entirely outside the boundaries 1700-present. The first test (crosses out of bounds) may represent problematic data or it may represent a large uncertainty. The second test (falls entirely out of bounds) likely flags data that contains errors (e.g. typos that leave a digit out of the year 190-05-18), but can potentially also flag rare but valid older material, and certain representations of zooarcheological material. This fits a principle of keeping tests simple and focused on particular potential problems.
I have a proposal for a change in the expected response. Instead of
"INTERNAL_PREREQUISITES_NOT_MET if there is no default designated date or the field dwc:eventDate is either not present or is EMPTY; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT"
I propose
"INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is either not present or is EMPTY; COMPLIANT if the no part of the range of dwc:eventDate extends outside optionally-provided begin and end dates; otherwise NOT_COMPLIANT. If no end date is provided, the test should use the current time as an upper bound."
I would also change the Notes. Instead of
"The results of this test are time-dependent. A invalid date for tomorrow will be valid tomorrow. This test provides the option to designate a lower limit to the date, which for specimen records should be 1700-01-01 by convention. (Thus this test has two parameters, a boolean to use or not use a lower bound, and a lower bound, which defaults to 1700-01-01). NB if the parameter is not set, it defaults to 1700-01-01."
I propose
"The results of this test are time-dependent. Today the date for tomorrow is not valid. Tomorrow it will be. This test provides the option to designate lower and upper limits to the date. The upper limit, if not provided should default to the time when the test is run. There should be no default lower limit. NB By convention, use 1700-01-01 as a lower limit for collecting dates of biological specimens."
The proposal from @tucotuco makes sense. We do, however need to specify how the test should behave when the dwc:eventDate is not a valid ISO date. I propose this should be INTERNAL_PREREQUISITES_NOT_MET
@chicoreus #66 cehcks for the ISO standard in eventDate. If #66 is run prior to #36, then it would have already been covered. This goes back to a workflow of the order of tests.
@ArthurChapman: I agree. This is what we concluded yesterday - that we should not need to re-test for a condition if it had already been tested. And yes, this means workflow dependencies (which we already had).
I agree with @chicoreus. Tests must be defined independent of each other and of any abstract workflow that might use them. Every tests must deal appropriately with whatever input it is given.
True that if our recommended workflow order is followed, this test might not be run at all when its internal prerequisites are not met.
@ArthurChapman we must not assume that implementors will run validations in any particular order, indeed, parallelized implementations where the order in which validations is run is non-deterministic are likely at large scale. Also, each test must be able to stand in isolation to be mixed and matched with other core or non-core tests to meet the needs of additional use cases. By imposing assumptions about validation order on the test definitions, we are in effect limiting their utility to only core use cases, not letting them be reusued for other needs.
Also, implementors should develop tests in parallel with unit tests of those implementations, and the unit tests should test the behavior of the tests under edge case conditions, text strings containing non-iso dates are expected edge cases for the testing of all of the tests that take dwc:eventDate as an information element, if we don't define it, the behavior will be undefined, and some implementors might make implementations that embed interpretation and return compliant for the same value that other implementors return as non-compliant, and other implementors return as prerequisites not met. Better to tell implementors what to do in this case, without assumptions about order of tests and the turning on and off of different tests.
Consider the value dwc:eventDate="1820-4-3" and three implementors who handle the format error differently in the test internals.
Implementor 1 tests for iso format, and returns non-compliant before testing range in #36.
Implementor 2 tests for iso format, and returns internal prerequisites not met before testing range in #36.
Implementor 3 parses the string into year/month/day-year/month/day integers, doesn't recognize that the format isn't correct, and ends up testing the range in #36.
Implementor 4 has a workflow system for the tests that doesn't run downstream tests that have their assumptions not met and never gets to #36.
End consumer of the data quality reports is confused.
If we are specific about the handling of the problematic case for this issue then:
All implementors test for iso format, and returns internal prerequisites not met before testing range in #36.
Or, implementors with the workflow system that recognizes test dependencies leave out #36. End users are not confused by some implementations saying their data are compliant and others saying it is not compliant.
@chicoreus Accepted. @Tasilee - we must change some of the Expected Response as a resulty of this decision.
This is one of two TIME tests that has an issue in implementation by the specification. I propose changing the specification from:
INTERNAL_PREREQUISITES_NOT_MET if there is no default designated date or the field dwc:eventDate is EMPTY; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT
to:
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate does not extend into the future and optionally does not extend before a date designated when the test is run, otherwise NOT_COMPLIANT
@chicoreus. Your change makes sense - as we do have a default Parameter date, so the first part of the old wording, now makes no sense. I am not sure, however, of the need for the word "optionally" as if there is no date designated, then it defaults to the default date. In some of the other tests we have used words something like "... dwc:eventDate does not extend beyond the Paramater range" or "... Parameter limits" That then caters for the earliest date if parameter set or the default, and also caters for the future with bdq:latestValidDate set as current date"
@chicoreus Another thought to bring it more in line with others and to make it positive rather than negative
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is within the parameter range, otherwise NOT_COMPLIANT
@ArthurChapman that latest makes good sense. It makes it clear that an eventDate who's range extends beyond the specified earliest and latest dates is not compliant, that is the eventDate must fall entirely within the range specified by the earliest and latest parameters.
@chicoreus. And it is consistent with what we have done for elevation, ddepth, etc. where we have used similar wording. I like simplicity and consistency - both of which should aid in coding.
Ditto on simple. I've updated the Expected Response.
Good. However the current definition doesn't match the notes which indicate that the test of the lower bound is optional. This can't be taken from just the specification and the parameters, and this statement exists in only the notes.
Notes simplified: Is @chicoreus happy?
@Tasilee I am happy, don't know if the archeozoological community will be happy. The intent, as I understand it, of making the lower bound optional was to accomodate them. @tucotuco, thougts?
Can't they simply use -100000-01-01 as "To represent years before 0000 or after 9999, the standard also permits the expansion of the year representation but only by prior agreement between the sender and the receiver.[19] An expanded year representation [±YYYYY] must have an agreed-upon number of extra year digits beyond the four-digit minimum, and it must be prefixed with a + or − sign"
They could, but the "agreed-upon number of extra year digits" is a potential problem, as we would have to specify the number of allowed extra digits, and 6 digits and up to -999,999 might not be enough. Since the parameter is to an api that we are specifying, the prior agreement bit isn't a concern. However, this gets us to whether dwc:eventDate allows for years before 0000, and, given that prior agreement phrasing, I rather suspect it doesn't, dates prior to 0000 would need to go off to the geological age terms under the current phrasing. @tucotuco?
@Tasilee by the way, and event_date_qc implementation without the lower bound being optional is about 1/3 the size and much cleaner to read than implementation where the lower bound is optional....
How about changing the wording from:
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is within the parameter range, otherwise NOT_COMPLIANT
to:
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is entirely within the range specified by the earliest and latest valid date parameters; otherwise NOT_COMPLIANT
Also note the inconsistency between the default parameter values (which are years) and the value in the notes (1600-01-01). Note that current year (taking 2019 as the current year), is interpreted as a date range ending at 2019-12-31, not on the current day (tomorrow, thus the current day, is explicitly mentioned in the notes), thus a date a few months in the future could be valid as the parameter defaults are currently stated.
My preference is to defer to the Parameters rather than anything more specific such as "earliest" / "latest". "Entirely" would seem wise however.
I agree @Tasilee - if for no other reason, than that is how we have written it elsewhere - thus for consistency
NTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is entirely within the parameter range; otherwise NOT_COMPLIANT
@ArthurChapman: That was the Expected Response version I edited this morning.
@ArthurChapman @Tasilee in implementing, I just find the phrasing "within the parameter range" difficult to interpret, I'd be much happier with a more explicit reference to the parameters. "Within the parameter range" makes me want to look for a single parameter, and perhaps for a single parameter called range, while this test specifies the potential for two parameters.
Probably entangled in this is that we haven't defined whether the guid for a test that takes parameters also applies to runs of that test that use values other than the default parameters. I think I would be happier specifying that the guid applies only to the test with only default values of the parameters, and that parameterized tests that take different parameters at runtime must be assigned a different guid, otherwise we risk confusion by consumers who again wonder why the identical test run by different agents on the same data ends up with different results.
I'm also thinking we need to put the actual parameter values other than defaults into the response data object for each test run, that's likely the only way to communicate to human consumers why test results differ....
It makes sense to me to have a Parameterised label and a matching Parameter term in our definitions with the parameter fields and values defined. Isolating the fields (and values) to one location then permits changes in one location.
An alternative would be to remove the Parameter term in the table and include something like "parameter:bdq:earliestValidDate(default:1600)" in the Expected Responses. That could be a tad verbose.
Regards guid @chicoreus - I agree that it needs to relate to defaults.
@chicoreus said:
"However, this gets us to whether dwc:eventDate allows for years before 0000, and, given that prior agreement phrasing, I rather suspect it doesn't, dates prior to 0000 would need to go off to the geological age terms under the current phrasing."
dwc:eventDate is silent on the valid range. The agreement would have to be a community effort to resolve. Now that there is an official ChronometricAge extension, the importance of this is much reduced, as there is a way to be explicit about the age of the material without conflating that with the date of collection - an issue that arises less often for modern specimens and observations.
Minor comment: The Notes say, "which for specimen records". What should happen for observations of various kinds? If the dwc:basisOfRecord comes to bear on the test, shouldn't it be among the information elements?
Rather than get into that morass, in place of the current Notes
"The results of this test are time-dependent: A invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which for specimen records should be 1600-01-01 by convention."
And though I hate to put in a default that can not be well justified, I would put
"The results of this test are time-dependent: A invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1600-01-01 by convention."
Thanks @tucotuco. I agree with your amendment to Notes and without further comment, applied it.
Altered Expected Response in line with other tests from
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is entirely within the parameter range, otherwise NOT_COMPLIANT |
TO
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, otherwise NOT_COMPLIANT
Hi,
Just wondering if the parameters for this validation
Default values: bdq:earliestValidDate="1600"
should be 1500 to be consistent with
because https://github.com/tdwg/bdq/issues/84#issuecomment-1251580931
Yes, good catch @ymgan. In this issue the bdq:earliestValidDate should also be "1500" So good to have more eyes on the test definitions!
@Tasilee I have updated the test. You may need to update the data. I cannot see any reason that these two dates should be different. Good catch @ymgan
Added edge case to test data.
I've edited the Expected Response according to @tucotuco suggestion:
From
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1:2019 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, otherwise NOT_COMPLIANT
to
INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is EMPTY or if the value of dwc:eventDate is not a valid ISO 8601-1 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive, otherwise NOT_COMPLIANT
and updated the References
I have updated the ISO Reference link
Pushing the default earliest date prior to 1582 raises a problem ( default bdq:earliestValidDate="1500" ) as without prior agreement, under ISO 8601-1, dates prior to the start of the Gregorian Calendar on 1582-11-15 are not valid. Thus dates in the range 1500-01-01/1582-11-14 could be reasonably expected by implementors to result in INTERNAL_PREREQUISITES_NOT_MET, as code evaluating them against ISO 8601-1 can plausibly assert that they are not validly formed ISO 8601-1 dates. The same concern applies to #84
From TG2 call per @tucotuco If you care about dates affected by unknown calendar use the start date 1918... Add note: (here and in #84), if your use requires knowledge of date to a precision of finer than one year and ten days, use 1918-02-14 as the earliestValidDate (as the calendar isn't certain).
Slightly edited notes from an email, with added notes in italics from TG2 call:
I'll suggest we switch to 1582-11-15. General agreement on this in TG2 call. That date is supportable on the basis of ISO 8601-1 asserting that the use of proleptic gregornian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data.
Since Darwin Core is mute on whether proleptic gregorian dates are allowed, no prior agreement exists, and we can argue that dates prior to this are automatically suspect.
In practice, dates prior to 1752 in the British empire, 1700 in various European protestant countries, 1918 in Russian territories (1918-02-14), are suspect, as those are the years of adoption of the gregorian calendar in those areas, and a reported date may not have the metadata needed to determine if it was a julian date as originally asserted, or has been converted to a gregorian date. So any analysis that depends on date precision of less than 10 days, can't simply use any date prior to 1918 without thinking harder about the sources of the date data... Proposal above from @tucotuco to specify in this test (and in #84) the value for bdq:earliestValidDate=1918-02-14 provides education that this may be a concern, and provides a means for users where this may be a concern to identify records where it may be a concern.
Difference between the gregorian and julian calendar has typically been around 10 days, but see the comparison on https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where there is no difference in most of years 100 to 200... Also year 0 may or may not exist...
But, it gets worse... there is the issue of what the start day of the year was, e.g. with the British civil year starting on March 25 instead of January 1. So dates from the British empire, or from British collectors from prior to 1752 may be off by 10 days and off by one year, depending.
Looks like a good explication on https://www.cree.name/genuki/dates.htm
Wikipedia cites this for the text "The best practice for citation of historically contemporary documents is to cite the date as expressed in the original text and to notate any contextual implications and conclusions regarding the calendar used and equivalents in other calendars. This practice permits others to re-evaluate the original evidence"
We expect dwc:eventDate to contain a gregorian date. dwc:verbatimEventDate allows for capture of a date as found in the original text, and eventRemarks does allow for the capture of metadata about the translation of a local julian date into a gregorian date. So the capability exists within Darwin Core to document transformations between calendars and the related evidence for so doing.
We'll likely also need to consider this in #86, at least by including metadata that the assumed calendar for verbatim date is gregorian.
Changed Parameter(s) to "bdq:earliestValidDate, bdq:latestValidDate".
I'll leave the outcomes of the 1500 and calendar discussions to @chicoreus to decide and implement. My conclusion to @tucotuco (loose and strict implementation) is to document (Notes) our Parameter(s) accordingly?