tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2 - Parameterized #178

Closed ArthurChapman closed 4 years ago

ArthurChapman commented 5 years ago

Having a look at the tests, we now seem to have added Parameterized to (virtually) every test where we have a vocabulary - even where (e.g. #62) the Vocabulary is an ISO Standard.

I am not sure that we have thought this through for each case.

  1. What are the parameters that need to be set (sometimes it appears to be the "specified source authority", in others an upper or lower limit - date, elevation, etc.)
  2. I think we need a default in most cases if the Parameter is not set. We have put that in some. i.e. a default vocabulary (e.g. TGN) or value (1 Jan 1753), etc.
  3. There is a lot of extra work when running the tests if one has to set a parameter for lots and lots (42) of tests. I think we need to make it clear (in Notes?) of what the Parameter is that needs to be set. It is not clear in some of the tests where we have Parameterized. In most it is specifying the Source Authority, in others an upper or lower limit. Are we over using parameterization? Do we need another field explaining what the parameter is?
tucotuco commented 5 years ago

I have to admit that the thought of a field for parameters occurred to me in passing as well. I think it would help make things clear. The field could contain a good descriptive name for the parameter(s) and the default value(s). Default values for vocabularies may be tough in some cases, as they do not exist, or are not vetted community wide, or they are not apt for inclusion as they are (e.g., TGN). What do we do in those cases? I definitely do not think we are over-using parametrization. I think it is super important to make the tests flexible. Note that test suites will have to include parameter values as well.

Tasilee commented 5 years ago

OK. This is what I was trying to get at with my comment on #63 about the correlation of vocabs and parameterized.

Tasilee commented 5 years ago

OK, in checking the first few of Parameterized with the longsuffering @ArthurChapman, there are syntax and content issues we need to standardize before I feel comfortable about making more changes (42 all up at the moment). So far Parameter(s) edited in table as examples-

163 Specified source authority, default = http://rs.gbif.org/vocabulary/gbif/rank.xml

162 Specified source authority, default = http://rs.gbif.org/vocabulary/gbif/rank.xml

141 earliest_year, default = 1700; latest_year, default = current year

139 Specified source authority, default = The Getty Thesaurus of Geographic Names (TGN: http://www.getty.edu/research/tools/vocabularies/tgn/index.html)

This raises

  1. Are we happy with "Specified source authority" and syntax for terms such as "XXX_YY"?
  2. Do we really need to use "optional"? I would think not as if needed, covered in Notes?
  3. Do we have a label as in "The Getty Thesaurus of Geographic Names" or "TGN" and/or a URL that provides more information if not the API address??
  4. We note re-use of References in Parameter(s), which may be fine?

--

ArthurChapman commented 5 years ago

There has been some discussion around default values for parameterized tests

  1. I would like to see a default in all cases where possible (even if these may change down the line as Vocabularies of Value are built.
  2. There has been discussion on default value for year (of eventdate, year, etc.) as apposed to taxonomic values which should be Linnean 1953. In some cases we are using (suggesting) 1700 as a lower limit of collecting dates for biological specimens (although @tucotuco has suggested that there be no default value). I think 1700 is too late for a default value as there are hundreds of thousands, if not millions of collections that predate 1700 - especially in Europe, Asia, etc. But what date should we set? 1650, 1600? Suggestions please - we need to finalize these
ArthurChapman commented 5 years ago

To answer @Tasilee should we use a link to a web address, a name ("The Getty Thesaurus of Geographic Names" or "TGN") or a link to an API? In the parameter field, I think it should be an API if possible for the default. The in the References, a full name and web link to the vocabulary.

  1. I am happy with "Specified source authority" . Not sure what you mean by "XXX_YY" - but if you mean NOT_FOUND, NO_REPORT, etc. - I am happy with these. Perhaps if there is confusion, we should add them to Vocabulary.
  2. Need to see an example - we do say something like does not extend beyond optionally provided begin and end dates. In this case - I don't think it is necessary - but make sure there is a default - if they want to set it then it is an option, if not it is the default
  3. Should have an API, The full name and web address should be in the references.
  4. Not sure what you mean here.
Tasilee commented 5 years ago

Thanks @ArthurChapman. I agree that we should supply a default even if it is 'best guess' as that will be helpful for implementers as a starting position.

Regarding default minimum year , I think you mean '1753' and not '1953'? With my limited taxonomic experience, '1600' would seem a reasonable 'flag-raising' point but my reservation is that I tend to err toward false positives rather than false negatives. Meaning, I would rather raise a flag for those below 1753 than to not flag those between 1600 and 1753.

The 'XXX-YYY' was to cover terms in the 'Expected response' such as 'NOT_EMPTY', 'NOT_COMPLIANT', 'NO_REPORT' etc. I will check that these are in the vocab, as they have grown with the implementation of the 'Expected responses'.

I agree that the Parameter defaults should ideally point to an API, but a) some don't exist, b) some exist but may not be tightly coupled to a 'standard' and c) some are hard to find.

My note about References in Parameters means that in some cases, we use the references as a link to defaults. In other cases, I have taken info from the 'Expected response', for example if there is a mention of 'authority'.

ArthurChapman commented 5 years ago

Yes 1753. There is no logical reason for selecting 1753 for collections - there is for taxonomy. I am not sure where we got 1700 and what the logic was for that. 1600 predates the years of major scientific exploration (Spanish, Portuguese, British and French).

chicoreus commented 4 years ago

Tests should only be parameterized when we have identified user stories in the areas that TG3 examined that clearly have different parts of the community wishing to use different parameters. The two only valid cases that come to my mind right off are application of a particular national taxonomic authority for tests involving scientific names and specifications of the earliest valid date for identifications or eventDates, where particular data sets are known by their users to have earliest valid dates.

Parameters must not point to hypothetical resources that are not available to implementors.

chicoreus commented 4 years ago

@ArthurChapman, yes, if we specify that a test is parameterized, we must specify a default value.

I suspect that the identifiiers (guids) for tests should only apply to implementations of those tests that use the default parameter values, and that implemenations which take other values should use different guids to allow for machine comparison of results, but as the intent of parameters is to change the test behavior at runtime that might significantly complicate implementation. One alternative (thinking in terms of annotated java methods ala the filtered push implementations), would be to have one identifier refer to a test with the default parameter, and another identifier refer to the same test, but with any other value for the parameter (java implementation on the order of

@Provides("baf2a90b-af45-4f1a-839f-47126743a48a")
public DQResponse<AmendmentValue> amendmentYearStandardized(
                       @ActedUpon("dwc:year") String year) 
{
    Integer minimumYear = 1753;
    return amendmentYearStandardized(year, minimumYear);
}
@Provides("ab37fd2a-fe95-4ab6-8a0c-e40ea3f97bb4")
public DQResponse<AmendmentValue> amendmentYearStandardized(
                     @ActedUpon("dwc:year") String year. Integer minimumYear) 
{
     // actual test implementation
}

), where the first method uses the guid currently specified for the test, and the second method uses a guid that we would need to specify for parameterized implementations.

Tasilee commented 4 years ago

@ArthurChapman and I have been discussing 'Needs work' tagged tests and resolved a few, but there are three remaining. Also, a question to the rest of you about the Expected Response regarding specified source authority. Should we

  1. leave the phrase specified source authority or should we use
  2. bdq:sourceAuthority ?
chicoreus commented 4 years ago

@Tasilee the updates to make the parameter values structured and consistent is great.

chicoreus commented 4 years ago

Significant remaining problem: A very large number of the tests which take parameters should not be parameterized. I've noted this on #20, only tests for which we have use cases where different user communities will expect the tests to behave in different ways should be parameterized (such as a country wishing to validate scientific names against a national list rather than a global one). We must not specify parameters that point implementors to a resource from which the controlled vocabulary for a particular test can be found, that is something for the notes. When the specification says, e.g. compliant if matching ISO vocabulary x, then the implementor must use that vocabulary, and where they get it an how they get it is an implementation detail, not a parameter.

All of the tests that have parameters need careful review to see if there is a clear use case for different users to expect different behaviors of the test for different uses, not whether or not there are multiple possible sources that could be used for some vocabulary.

chicoreus commented 4 years ago

We have 41 tests that specify parameters. It looks to me like only 18 of those are actually candidates for parameterization, and each of these needs careful consideration and identification of the use cases that require the test to be parameterized.

No. Name Parameter
84 VALIDATION_YEAR_OUTOFRANGE bdq:earliestDate = 1600, bdq:latestDate = current year
107 VALIDATION_MINDEPTH-MAXDEPTH_OUTOFRANGE bdq:minimumValidDepthInMeters = 0, bdq:maximumValidDepthInMeters = 11000
112 VALIDATION_MAXELEVATION_OUTOFRANGE bdq:minimumValidElevationInMeters = -423, bdq:maximumValidEvelavtionInMeters = 8850
122 VALIDATION_GENUS_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
123 VALIDATION_CLASSIFICATION_AMBIGUOUS bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
22 VALIDATION_PHYLUM_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
28 VALIDATION_FAMILY_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
45 AMENDMENT_POLYNOMIAL_STANDARDIZED bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
46 VALIDATION_POLYNOMIAL_NOTSTANDARD bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
57 AMENDMENT_TAXONID_FROM_TAXON bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
70 VALIDATION_TAXON_AMBIGUOUS bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
71 AMENDMENT_SCIENTIFICNAME_FROM_TAXONID bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
77 VALIDATION_CLASS_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
81 VALIDATION_KINGDOM_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
83 VALIDATION_ORDER_NOTFOUND bdq:sourceAuthority (default = https://www.gbif.org/en/developer/species)
76 VALIDATION_DATEIDENTIFIED_OUTOFRANGE Default values: bdq:earliestDate = 1753-01-01, bdq:latestDate = current day
36 VALIDATION_EVENTDATE_OUTOFRANGE Default values: bdq:earliestValidDate = 1600, bdq:latestValidDate = current year
39 VALIDATION_MINELEVATION_OUTOFRANGE Default values: bdq:minimumValidElevationInMeters = -428, bdq:maximumValidElevationInMeters = 8850
102 AMENDMENT_GEODETICDATUM_ASSUMEDDEFAULT (but not: bdq:sourceAuthority (default = http://epsg.io/))
chicoreus commented 4 years ago

The following tests have parameters and look to me like they very unambiguously must not be parameterized. The resources mentioned should be moved either into the specification or the notes, and not specified as a parameter.

No. Name Parameter
106 AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXON bdq:sourceAuthority (default = (https://dwc.tdwg.org/terms/#identificationQualifier)
59 VALIDATION_GEODETICDATUM_NOTSTANDARD bdq:sourceAuthority (default = http://epsg.io/)
60 AMENDMENT_GEODETICDATUM_STANDARDIZED bdq:sourceAuthority (default = http://epsg.io/)
51 VALIDATION_COORDINATES_TERRESTRIALMARINE bdq:sourceAuthority (default = http://irmng.org)
162 VALIDATION_TAXONRANK_NOTSTANDARD bdq:sourceAuthority (default = http://rs.gbif.org/vocabulary/gbif/rank.xml)
163 AMENDMENT_TAXONRANK_STANDARDIZED bdq:sourceAuthority (default = http://rs.gbif.org/vocabulary/gbif/rank.xml)
104 VALIDATION_BASISOFRECORD_NOTSTANDARD bdq:sourceAuthority (default = http://rs.tdwg.org/dwc/terms/index.htm#basisOfRecord)
63 AMENDMENT_BASISOFRECORD_STANDARDIZED bdq:sourceAuthority (default = http://rs.tdwg.org/dwc/terms/index.htm#basisOfRecord)
133 AMENDMENT_LICENSE_STANDARDIZED bdq:sourceAuthority (default = https://creativecommons.org/)
38 VALIDATION_LICENSE_NOTSTANDARD bdq:sourceAuthority (default = https://creativecommons.org/)
97 VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED bdq:sourceAuthority (default = https://dwc.tdwg.org/terms/#identificationQualifier)
115 AMENDMENT_OCCURRENCESTATUS_STANDARDIZED bdq:sourceAuthority (default = https://dwc.tdwg.org/terms/#occurrenceStatus)
116 VALIDATION_OCCURRENCESTATUS_NOTSTANDARD bdq:sourceAuthority (default = https://dwc.tdwg.org/terms/#occurrenceStatus)
20 VALIDATION_COUNTRYCODE_NOTSTANDARD bdq:sourceAuthority (default = https://restcountries.eu/#api-endpoints-list-of-codes)
48 AMENDMENT_COUNTRYCODE_STANDARDIZED bdq:sourceAuthority (default = https://restcountries.eu/#api-endpoints-list-of-codes)
62 VALIDATION_COUNTRY_COUNTRYCODE_INCONSISTENT bdq:sourceAuthority (default = https://restcountries.eu/#api-endpoints-list-of-codes)
73 AMENDMENT_COUNTRYCODE_FROM_COORDINATES bdq:sourceAuthority (default = https://restcountries.eu/#api-endpoints-list-of-codes)
50 VALIDATION_COORDINATES_COUNTRYCODE_INCONSISTENT bdq:sourceAuthority (default = https://www.iso.org/obp/ui)
118 AMENDMENT_GEOGRAPHY_STANDARDIZED bdq:sourceAuthority (default = The Getty Thesaurus of Geographic Names (TGN: http://www.getty.edu/research/tools/vocabularies/tgn/index.html))
139 VALIDATION_GEOGRAPHY_NOTSTANDARD bdq:sourceAuthority (default = The Getty Thesaurus of Geographic Names (TGN: http://www.getty.edu/research/tools/vocabularies/tgn/index.html))
21 VALIDATION_COUNTRY_NOTSTANDARD bdq:sourceAuthority (default = The Getty Thesaurus of Geographic Names (TGN: http://www.getty.edu/research/tools/vocabularies/tgn/index.html))
95 VALIDATION_GEOGRAPHY_AMBIGUOUS bdq:sourceAuthority (default = The Getty Thesaurus of Geographic Names (TGN: http://www.getty.edu/research/tools/vocabularies/tgn/index.html))
ArthurChapman commented 4 years ago

@chicoreus I will look at this in detail when I get back home (away at the moment), but the Geodetic Datum (#102, #59, #60) ones should be Paramaterized as different jurisdictions use different defaults (some by legislation - eg. Brazil) and WGS84 may not always be the best default. In Brazil, for example, if no datum is specified, you can be nearly certain that the default is either SAD69(96) or SIRGAS2000 (depending on the date). Also many jurisdictions are using Coordinate Reference Systems (CRS) rather then datums as these are more often than not what is being given on GPS units. I will check their wording later. Like you, I think we have unnecessarily made too many tests Paramaterized. @tucotuco may have good reasons for some of these, but I think we need to justify each test. Perhaps there are comments with justifications under the individual tests - I will check later.

chicoreus commented 4 years ago

@ArthurChapman looks like #102 should be parameterized, while #59 and #60 should not. Added notes in those issues.

chicoreus commented 4 years ago

I've updated the tables in the comments above accordingly, moving #102 into should be parameterized.

ArthurChapman commented 4 years ago

Having looked at your list @chicoreus for tests that "shouldn't" be Paramaterized - I have the following comments.

No. Name Parameter
106 AMENDMENT_IDENTIFICATIONQUALIFIER_FROM_TAXON I think this was so people could add characters that they could look for "?", "cf." "aff." or could add others. I'd be happy either way with this one.
102 AMENDMENT_GEODETICDATUM_ASSUMEDDEFAULT as I noted in previous comment - should be Paramaterized
59 VALIDATION_GEODETICDATUM_NOTSTANDARD Should not be Paramaterized
60 AMENDMENT_GEODETICDATUM_STANDARDIZED Should not be Paramaterized
51 VALIDATION_COORDINATES_TERRESTRIALMARINE This one was parameterized because of two ways of checking for isMarine 1) using GIS/Google Maps to determine if on land or not 2) using a list of marine species and checking if in that list or not. We could decide to use only one method and then remove from Paramaterized
162 VALIDATION_TAXONRANK_NOTSTANDARD I would be happy for us to decide to go with the GBIF Rank Vocabulary (there is no real alternative) and remove Paramaterization
163 AMENDMENT_TAXONRANK_STANDARDIZED I would be happy for us to decide to go with the GBIF Rank Vocabulary (there is no real alternative) and remove Paramaterization
104 VALIDATION_BASISOFRECORD_NOTSTANDARD I would be happy for us to decide to go with the DwC recommended (it can always be formal;ised later) and remove Paramaterization
63 AMENDMENT_BASISOFRECORD_STANDARDIZED I would be happy for us to decide to go with the DwC recommended (it can always be formalised later) and remove Paramaterization
133 AMENDMENT_LICENSE_STANDARDIZED Problem I see here is that we are following dcterms:license - which could be broader than just Creative Commons. Do we wish to restrict to Creative Commons, or allow other license conditions to be valid? and thus allow someone to chose different vocabulary?
38 VALIDATION_LICENSE_NOTSTANDARD Problem I see here is that we are following dcterms:license - which could be broader than just Creative Commons. Do we wish to restrict to Creative Commons, or allow other license conditions to be valid? and thus allow someone to chose different vocabulary?
97 VALIDATION_IDENTIFICATIONQUALIFIER_DETECTED I think this was so people could add characters that they could look for "?", "cf." "aff." or could add others. I'd be happy either way with this one.
115 AMENDMENT_OCCURRENCESTATUS_STANDARDIZED Currently, DwC only recommends "present" "absent". I understand some would like this broadened. But as it stands with only two options, I don't see why it should be Paramaterized unless a community (invasives?) want to use a different vocabulary. @tucotuco paramaterized this - what was the thinking? A paper currently in press is recommending modification to include a third term "doubtful" - but if this is accepted (or not) - I only see the one vocabulary that we would be using - and hopefully it will be eventually formalised beyond a mere DwC recommendation. I thus don't see a strong justification for Paramaterization
116 VALIDATION_OCCURRENCESTATUS_NOTSTANDARD See comment above.
20 VALIDATION_COUNTRYCODE_NOTSTANDARD As noted in a comment under #20, I see no reason for Paramatarization
48 AMENDMENT_COUNTRYCODE_STANDARDIZED As noted in a comment under #20, I see no reason for Paramaterization
62 VALIDATION_COUNTRY_COUNTRYCODE_INCONSISTENT As noted in a comment under #20 we refer in the description to an ISO code, so I see no reason for Paramaterization
73 AMENDMENT_COUNTRYCODE_FROM_COORDINATES This might be a more difficult one as the ISO Standard doesn't have geographic boundaries. So there may need to be some variation on what one chooses as the method for determining boundaries. We still have decide on this....
50 VALIDATION_COORDINATES_COUNTRYCODE_INCONSISTENT Similar to #73
118 AMENDMENT_GEOGRAPHY_STANDARDIZED The geography ones, I am not sure about - we need further discussion on these and what we should use. TGN may be OK for some - Google Maps for others???? There is a discussion somewhere under an issue that I can't find at the moment.
139 VALIDATION_GEOGRAPHY_NOTSTANDARD See comment above under #118
21 VALIDATION_COUNTRY_NOTSTANDARD See comment above under #118
95 VALIDATION_GEOGRAPHY_AMBIGUOUS See comment above under #118
ArthurChapman commented 4 years ago

Agreed @chicoreus re #102, #59 and #60. #102 Paramaterized, #59 and #60 not - with bdq:sourceAuthoriity=http://epsg.io/

ArthurChapman commented 4 years ago

Copied from #102 as comment applicable to more than just that test With all tests (especially NOTSTANDARD and STANDARDIZED tests) that use an external Standard - ISO, DCMI, EPSG, or any Vocabulary, the vocabulary, standard, etc. is the bdq:sourceAuthority and you are checking to see if the value in the record is a valid record in the bdq:sourceAuthority (in the case of Validations) or can be amended to conform with a value in the bdq:sourceAuthority (in the case of Amendments). In nearly all cases, there is only one sourceAuthority (except as @chicoreus mentions with Taxon names), so there is no choice of sourceAuthority needed, only the choice of a value from that sourceAuthority. Those few cases where there is a choice of sourceAuthority (taxon names) you require both 1) a choice of bdq:sourceAuthority, and 2) a choice of value within that source authority. Thus, I agree with @chicoreus that we don't need as many Paramaterized tests as we have previously so tagged. Unless @tucotuco has justifications for them that we have not thought of.

ArthurChapman commented 4 years ago

133 and #38 I think should be Paramaterized - see my comments in the table above i.e. "Problem I see here is that we are following dcterms:license - which could be broader than just Creative Commons. Do we wish to restrict to Creative Commons, or allow other license conditions to be valid? and thus allow someone to chose different vocabulary?" I am also concerned that some jurisdictions may legislate the licences they can use within that jurisdiction and they may not be Creative Commons

Tasilee commented 4 years ago

Thanks @chicoreus and @ArthurChapman. Reading through the table and your comments Arthur, here is my take on it. Maybe after a Pinot Noir or two, I would think differently.

106 - Parameterised

102 - Parameterised

59 - Not parameterised

60 - Not parameterised

51 - Parameterised (for now)

162 - Not parameterised

163 - Not parameterised

104 - Not parameterised

63 - Not parameterised

133 - Parameterised

38 - Parameterised

97 - Parameterised

115 - Not parameterised

116 - Not parameterised

20 - Not parameterised

48 - Not parameterised

62 - Not parameterised

73 - Parameterised

50 - Parameterised

118 - Parameterised

139 - Parameterised

21 - Parameterised

95 - Parameterised

@tucotuco : We would value your discerning eye (or two) on this lot. I'll hold off edits for a response. I hope all is ok over there.

ArthurChapman commented 4 years ago

I Think you missed a few @Tasilee Paramaterized

22, #28, #36, #38, #45, #46, #57, #70, #71, #76, #77, #81, #83, #84, #79, #102, #107, #112, #122, #123, #133

Not Paramaterized

20, #21, #48, #50, #51, #59, #60, #62, #63, #73, #95, #97, #104, #106, #115, #116, #118, #139, #162, #163

@tuco might particularly like to comment on (see my table and comments above) #51, #115, #116, #73, #50, #118, #139, #21, #95

Tasilee commented 4 years ago

@ArthurChapman: I was using the table only..so will add missing into here. And BTW, you also missed #39 (Parameterised), #79 isn't parameterised:

20 - Not parameterised

21 - Parameterised

22 - Parameterised

28 - Parameterised

36 - Parameterised

38 - Parameterised

39 - Parameterised

45 - Parameterised

46 - Parameterised

48 - Not parameterised

50 - Parameterised

51 - Parameterised (for now)

57 - Parameterised

59 - Not parameterised

60 - Not parameterised

62 - Not parameterised

63 - Not parameterised

70 - Parameterised

71 - Parameterised

73 - Parameterised

76 - Parameterised

77 - Parameterised

79 - Not parameterised

81 - Parameterised

83 - Parameterised

84 - Parameterised

95 - Parameterised

97 - Parameterised

102 - Parameterised

104 - Not parameterised

106 - Parameterised

107 - Parameterised

112 - Parameterised

115 - Not parameterised

116 - Not parameterised

118 - Parameterised

122 - Parameterised

123 - Parameterised

133 - Parameterised

139 - Parameterised

162 - Not parameterised

163 - Not parameterised

Tasilee commented 4 years ago

I am presuming for the Not parameterised above, we move any reference to a default source authority to the References section? That is, the Parameter field is EMPTY.

ArthurChapman commented 4 years ago

@Tasilee I guess that would make sense, however it doesn't distinguish the default or target source Authority from any other reference. Perhaps we should put them in the Reference but as "bdq:sourceAuthority=xxxxxxx" and then the other references

Tasilee commented 4 years ago

@ArthurChapman - that seems like a good strategy. I'll tackle the updates on Monday to give @tucotuco and @pzermoglio a chance to comment.

tucotuco commented 4 years ago

Sorry folks, though I think there are a couple of good catches in this discussion, I am afraid that some of it will take us into circular reasoning. I think most of the tests that were tagged to be parametrized were correctly so. A big part of my stance on this is hidden in a comment to issue #63 (https://github.com/tdwg/bdq/issues/63#issuecomment-491877591). Basically, Darwin Core is not a source authority for values. But that is only part of the issue. The other is that we can't make standardizations without a thesaurus (or at least a simple lookup table) - controlled vocabularies are not enough. This is the reason we brought TG4 into existence, recognizing this fundamental need to develop the tests in tandem with the vocabularies that allow them to actually function.

Some specific comments...

I would like to challenge this statement by @chicoreus: "Tests should only be parameterized when we have identified user stories in the areas that TG3 examined that clearly have different parts of the community wishing to use different parameters."

Why? Can't it be evident aside from the work in TG3? Are the results of TG3 exhaustive for all time?

I would also like to propose an amendment to the statement by @chicoreus:

"Parameters must not point to hypothetical resources that are not available to implementors."

Instead of "Parameters", this should be "Default sources".

@Tasilee asked "Should we

  1. leave the phrase specified source authority or should we use
  2. bdq:sourceAuthority ?

I vote for bdq:sourceAuthority. For example, change "using a specified source authority service" to "using the bdq:sourceAuthority".

I would like to challenge this statement by @chicoreus:

"We must not specify parameters that point implementors to a resource from which the controlled vocabulary for a particular test can be found, that is something for the notes. When the specification says, e.g. compliant if matching ISO vocabulary x, then the implementor must use that vocabulary, and where they get it an how they get it is an implementation detail, not a parameter."

I agree for VALIDATION tests where the vocabulary is written in stone. This is not true of most Darwin Core terms, which make recommendations, not requirements. The philosophy has always been to decouple requirements from definitions wherever possible. All of the AMENDMENT_ tests need a parameter to point to a source for the lookups. If we only used controlled vocabularies, we couldn't do any standardization, because only the standard values would be found, not the values from which the standard values would be determined. I do agree that there is a subset of tests that we currently have as parametrized that need not be. To me, these are only #20 (TG2-VALIDATION_COUNTRYCODE_NOTSTANDARD), #21 (TG2-VALIDATION_COUNTRY_NOTSTANDARD), #59 (TG2-VALIDATION_GEODETICDATUM_NOTSTANDARD), #79 (TG2-VALIDATION_DECIMALLATITUDE_OUTOFRANGE), #162 (TG2-VALIDATION_TAXONRANK_NOTSTANDARD). #21 and 59 will need to be explicit about the expectations. For example, for #21, it must be explicit whether the preferred name is the standard name, or if any of the names in any of the names or codes are acceptable standard names. For #59, it will need to be made explicit whether the epsg code is the only standard (because its the only thing that is unambiguous), or if any of the names in Geodetic CRS, Datum, or Ellipsoid are also acceptable.

Again, sorry, especially that it took this long to respond, but it was unavoidable.

ArthurChapman commented 4 years ago

One issue that @tucotuco's comments bring up is the urgent need for Vocabularies of Values to be created for all the current Darwin Core terms that are currently refrerred to in the tests. Perhaps TG4 (at Leiden?) needs to establish a working group under the TG with the remit to create as many Vocabularies of Values for those terms that are possible in the short term (especially beginning with the easy ones). Some, I think, only have a limited number of terms, but we will need to formalise them under the format that TG4 is proposing to develop. I guess a first step is to make a list, with an assessment of what is required, and a work program. @pzermoglio something for the agenda in Leiden - perhaps discuss informally on the Sunday.

Tasilee commented 4 years ago

Thanks @tucotuco. Good to have your insights again, but I am struggling. I will repeat a comment I made somewhere among the tests. We have two scenarios for Parameterised

  1. Genuine options for bdq:sourceAuthority (e.g., #28) and
  2. Options for a default value (e.g., #133 )

Your comment "we can't make standardizations without a thesaurus (or at least a simple lookup table) - controlled vocabularies are not enough" focuses on the second scenario. But surely we can't anticipate every possible misspelling or incorrectly interpreted 'value' to lookup? I guess I am assuming in at least some of the AMENDMENTS, that we are using pattern matching in the test code to have a stab at interpreting a potential target. Take the example in #133

dc:license="CCZero" becomes dc:license="https://creativecommons.org/publicdomain/zero/1.0/", following the Creative Commons vocabulary.

@tucotuco: You are implying that we have a thesaurus that contains "CCZero"?

As usual, I am probably missing something.

Also, I have to bow to your Darwin Core philosophy: "Darwin Core is not a source authority for values". Our tests are Darwin Core based (and hence scenario 1 above is not applicable), but scenario 2 is. We are indeed stuffed in terms of vocabs (let alone thesauri), hence TG4, but we need to grab onto any straw we currently have, and DwC 'values' are a 'port in a storm'?

ArthurChapman commented 4 years ago

@Tasilee I think we do need vocabularies/thesauri. License is a difficult one - but CCZero could = CC0 (1.0) or CC0 (1.0) Universal, etc. and then link to https://creativecommons.org/publicdomain/zero/1.0/. Also with many of the earlier Creative Commons there were many Ports (versions in different languages - see for examplke, https://creativecommons.org/tag/porting/). Version 4.0 is suppoosed to be a Universal set without the need for Porting, and that is encouraged for all new uses. A thesuarus would hopefully list these and (maybe) sononymise many.

@tucotuco has extracted the licensing records from GBIF. Many (majority) are in the form of "ex coll. " These aren't very helpful as they just refer back to the original institution, etc. I am looking through the list to see if we can extraxt a basic set of options - especially with CC, but in addition there are various country licenses (e.g. http://open.canada.ca/en/open-government-licence-canada) and there are ODC licenses (Open Data Commons) - e.g. Open Data Commons Attribution License: http://www.opendatacommons.org/licenses/by/1.0/. I will see what I can come up with when I get time.

tucotuco commented 4 years ago

I am saying explicitly, not implying, that we have a thesauri for vocabularies of terms that need to be cleaned. So yes, a license lookup that says 'CCzero' is a synonym of the unequivocally preferred term ' https://creativecommons.org/publicdomain/zero/1.0/'.

My point is that values alone don't help us do any lookups - whether taken from the examples given in Darwin Core (examples are no longer even canonical) or elsewhere. Pattern matching is an implementation solution, not a community data-driven one, which means we would rely on tech people to make the mappings, not on the people who know (and are even responsible for) the state of the domain.

I do not see two scenarios. Both examples need a source authority and we decided that all tests that take a parameter should have a default value for that parameter. To me it is best to be able to specify the source authority when there isn't a single definitive option. This is in order to decouple the test and the data used for the test, so that tests are less likely to be implementation dependent. Imagine certifying an implementation

We can't effectively anticipate every possible nonsense that might come along. I agree. We don't need to. But we can certainly create a lookup of every bit of nonsense that has been seen so far, and we can strive for an infrastructure that accumulates new nonsense as it arises and lets us provide the lookups for those as we move forward.

I hope that helps explain where I am coming from.

On Sun, Sep 8, 2019 at 6:46 PM Lee Belbin notifications@github.com wrote:

Thanks @tucotuco https://github.com/tucotuco. Good to have your insights again, but I am struggling. I will repeat a comment I made somewhere among the tests. We have two scenarios for Parameterised

  1. Genuine options for bdq:sourceAuthority (e.g., #28 https://github.com/tdwg/bdq/issues/28) and
  2. Options for a default value (e.g., #133 https://github.com/tdwg/bdq/issues/133 )

Your comment "we can't make standardizations without a thesaurus (or at least a simple lookup table) - controlled vocabularies are not enough" focuses on the second scenario. But surely we can't anticipate every possible misspelling or incorrectly interpreted 'value' to lookup? I guess I am assuming in at least some of the AMENDMENTS, that we are using pattern matching in the test code to have a stab at interpreting a potential target. Take the example in #133 https://github.com/tdwg/bdq/issues/133

dc:license="CCZero" becomes dc:license=" https://creativecommons.org/publicdomain/zero/1.0/", following the Creative Commons vocabulary.

@tucotuco https://github.com/tucotuco: You are implying that we have a thesaurus that contains "CCZero"?

As usual, I am probably missing something.

Also, I have to bow to your Darwin Core philosophy: "Darwin Core is not a source authority for values". Our tests are Darwin Core based (and hence scenario 1 above is not applicable), but scenario 2 is. We are indeed stuffed in terms of vocabs (let alone thesauri), hence TG4, but we need to grab onto any straw we currently have, and DwC 'values' are a 'port in a storm'?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/178?email_source=notifications&email_token=AADQ7257LNTO6SRIFE4KMY3QIVXD7A5CNFSM4HMTTKZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6F2EKQ#issuecomment-529244714, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQ725ZIJP4GAHFDIDRBGDQIVXD7ANCNFSM4HMTTKZA .

Tasilee commented 4 years ago

@tucotuco - "Pattern matching is an implementation solution". I agree. I was unaware of the extent on thesauri to our issues - which is a more 'standard' solution that is openly accessible and hopefully understandable.

This reminds me of the eureka moment aeons ago in TDWG (TIP days) when I realized that we needed an effective environment for the creation and management of ontologies. We needed an environment created by 'programmers' that made it easy to add terms, definitions and relationships. As far as I am aware, such a user (application domain specialist)-centric environment still doesn't exist (but I could be wrong as I have not recently researched it).

I think such an environment for biodiversity informatics-related thesauri (term -> preferred standard term, definition, comments and links etc) would be nice. A wiki style of management? A list by itself is a start, but when isolated and without provenance, is less than optimal. Governance is a key issue. If there is an 'authority', grand, but the system still needs to be open to public comment for efficient improvements.

tucotuco commented 4 years ago

I totally agree. I think ontology management has progressed well and has viable environments and tools. Some of our vocabs would be best accommodated by ontologies, especially basisOfRecord. For the rest, I think it is high time we dive in and play with what Tim has to offer,

On Mon, Sep 9, 2019 at 7:39 PM Lee Belbin notifications@github.com wrote:

@tucotuco https://github.com/tucotuco - "Pattern matching is an implementation solution". I agree. I was unaware of the extent on thesauri to our issues - which is a more 'standard' solution that is openly accessible and hopefully understandable.

This reminds me of the eureka moment aeons ago in TDWG (TIP days) when I realized that we needed an effective environment for the creation and management of ontologies. We needed an environment created by 'programmers' that made it easy to add terms, definitions and relationships. As far as I am aware, such a user (application domain specialist)-centric environment still doesn't exist (but I could be wrong as I have not recently researched it).

I think such an environment for biodiversity informatics-related thesauri (term -> preferred standard term, definition, comments and links etc) would be nice. A wiki style of management? A list by itself is a start, but when isolated and without provenance, is less than optimal. Governance is a key issue. If there is an 'authority', grand, but the system still needs to be open to public comment for efficient improvements.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/178?email_source=notifications&email_token=AADQ72YWXCNJ4GMDV323GR3QI3GBLA5CNFSM4HMTTKZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JH45A#issuecomment-529694324, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQ7264R6DAATG77CCHFT3QI3GBLANCNFSM4HMTTKZA .

Tasilee commented 4 years ago

We have a quorum to CLOSE.