Open iDigBioBot opened 6 years ago
Comment by Paula Zermoglio (@pzermoglio) migrated from spreadsheet: It would seem that a scientificName consistency test is needed: scientificName is consistent with what's provided in genus, specificEpithet, etc. Added a test at the bottom. Also, I believe the converse tests should be included: genus, specificEpithet, infraspecificEp, sciNameAut completed from sciName. "GENUS_FROM_SCI_NAME" and the like
Comment by Paul Morris (@chicoreus) migrated from spreadsheet: This can't be implemented until dwc:genericEpithet is approved. dwc:genus is NOT the atomic parse of genus from scientific name, it is genus into which the occurrence is classified, for types the two of these can differ.
Comment by Arthur Chapman (@ArthurChapman) migrated from spreadsheet: I don't understand @PJM - a Genus CAN be parsed from a binomial by definition - at least in the Botanical Code. The Zoological Code doesn't inlcude the concept of a 'Specific Epithet' whereas the Botanical Code does (I am not up to date on Zoological Code but there was some discussion on adopting the concept from the Botanical Code) but as I understand both codes - "GENUS" can be standalone and does not need a separate GENUS Epithet concept.
Comment by Paul Morris (@chicoreus) migrated from spreadsheet: but we can't implement this until dwc:genericEpithet is approved.
Phrasing of "scientificName was added", needs clearer specification, "added" creates ambiguity about intention, unclear if implementors should only fill in empty scientificName, or if existing values should be changed. Specification needs to be clearer.
I've commented on the issues noted in @chicoreus email of September 1. Does that email raises a new (GitHub) issue as it would be good to document more consistently?
From @chicoreus : #71 ... AMENDED if dwc:scientificName was EMPTY and a value was added from a lookup of the dwc:taxonID in the bdq:sourceAuthority; otherwise NOT_CHANGED
Suggestion: We usually add the prerequesites in theINTERNAL_PREREQUISITES_NOT_MET rather than in the AMENDED part, so I suggest moving the dwc:scientificName was NOT_EMPTY Thus:
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority service was not available; INTERNAL_PREREQUISITES_NOT_MET if the field dwc:taxonID is EMPTY, the value of dwc:taxonID is ambiguous or the dwc:scientificName was NOT_EMPTY; AMENDED if value was added from a lookup of the dwc:taxonID in the bdq:sourceAuthority; otherwise NOT_CHANGED
Thanks @ArthurChapman - I agree that where possible, we include such tests in the INTERNALs. That reads well to me. Editing.
I have changed Expected response to "EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority service was not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY, the value of dwc:taxonID is ambiguous or dwc:scientificName was not EMPTY; AMENDED dwc:scientificName from a successful lookup of dwc:taxonID in the bdq:sourceAuthority; otherwise NOT_CHANGED"
As noted elsewhere, we need to decide where "the value of dwc:...." as against "dwc:...".
the value of dwc:taxonID is ambiguous vs dwc:scientificName was not EMPTY
Also noted another reversion to NOT_EMPTY!
Note @Tasilee that in the TG2 Vocabulary (#152) we have the term NOTEMPTY (A field that is present and has content.) Do we need to change the term in #152?
@Tasilee "the value of dwc:taxonID is ambiguous or dwc:scientificName was not EMPTY;" probably is a good example, value of dwc:x is ambiguous, talking explicitly about the value, and dwc:x is empty indicating that the term is empty, one option within that scope being that the value is an empty string.
@ArthurChapman , If we need both EMPTY and NOT_EMPTY, then we should probably define NOT_EMPTY as simply the logical inverse of EMPTY, if we don't need it, then we could reference "not EMPTY" in the specifications.
BTW, we have three tests with labels
TG2-NOTIFICATION_ANNOTATION_NOTEMPTY TG2-NOTIFICATION_DATAGENERALIZATIONS_NOTEMPTY TG2-NOTIFICATION_ESTABLISHMENTMEANS_NOTEMPTY
Currently all references in Expected responses are now "not EMPTY" so I would concur with @chicoreus
I think it is in #152 because of the three test names. We can leave it there as that definition applies to those three. But in the tests use not EMPTY.
I just checked the example and it needed to be amended to "https://api.gbif.org/v1/species/8102122" (Note "/v1"). I suspect a few more of these may be in github. I will see what I can find. This issue is it as far as I can tell.
The Expected Response here contains "INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY" but the paired VALIDATION https://github.com/tdwg/bdq/issues/120 will have already checked this, so is this this component of the Expected Response redundant? There are similar situations with many of the AMENDMENTs. In other words, would we even run the AMENDMENT or just report automatically?
@Tasilee All of the tests have to be defined as if run in isolation. A linear workflow with a validation before a specific amendment is only one possible alternative. Thus the prerequisites for an amendment would be expected overlap with validations.
We did discuss a workflow but then it was sort of agreed that each test needed to be run in isolation as stated by @chicoreus. I don't think we discussed it fully (need another face to face) but from memory, it was thought different institutions may run the tests differently, or only run some tests and not others and thus they needed to be standalone.
Thanks @ArthurChapman. I also vaguely remember such a discussion about each being somewhat independent (but AMENDMENTs are - the way we designed them) dependent on their equivalent VALIDATIONs. When it comes to generating the test data, the chooks come home to roost.
@Taslee the amendments relate to, but are not dependent on validations. One expected workflow is to run all validations in parallel, then run all amendments in parallel, then run all validations again with all amendments accepted to measure how much a data set might have its fittness increased by accepting annotations. Another plausible workflow is to run all amendments followed by all validations, accepting amended data that has passed all the tests. A core requirement is that each test be able to stand on its own.
Changed "AMENDED" to "FILLED_IN" in accordance with discussions April 16.
Amended Example to align with @chicoreus email comments 17th June 2022.
Email discussion on the Expected Response as per similar issue with #56. In this case, it is the repeat of the "ambiguity" of dwc:taxonID that worries me.
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY, the value of dwc:taxonID is ambiguous or dwc:scientificName was not EMPTY; FILLED_IN the value of dwc:scientificName if the value of dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority; otherwise NOT_AMENDED
My point is that ambiguity in dwc:taxonID will result in INTERNAL_PREREQUISITES_NOT_MET, so the second check "dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority" will never be activated.
So in this case, I'd suggest we remove the first occurrence to have
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY or dwc:scientificName was not EMPTY; FILLED_IN the value of dwc:scientificName if the value of dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority; otherwise NOT_AMENDED
True?
I agree with this - but in line with #56 should we add invalid into the INTERNAL_PREREQUISITES_NOT_MET or does this just complicate the issue?
e.g. EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY or invalid or dwc:scientificName was not EMPTY; FILLED_IN the value of dwc:scientificName if the value of dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority; otherwise NOT_AMENDED
I think I would be happy either way on this - but for consistency?
I'm happy enough with adding the "invalid" as I presume from @chicoreus, we can detect the 'invalidity' and that is totally different from the 'ambiguity' aspect? @tucotuco and @chicoreus ? How say you?
It is looking like we need to document a clear and concise rule for this type of issue.
On Sun, 26 Feb 2023 15:16:22 -0800 Arthur Chapman @.***> wrote:
I agree with this - but in line with #56 should we add invalid into the INTERNAL_PREREQUISITES_NOT_MET or does this just complicate the issue?
e.g. EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY or invalid or dwc:scientificName was not EMPTY; FILLED_IN the value of dwc:scientificName if the value of dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority; otherwise NOT_AMENDED
I think I would be happy either way on this - but for consistency?
For this one, I think not, as there isn't any easy test for invalidity of a taxonId value (unlike dates, geodetic datum values, etc.).
On Sun, 26 Feb 2023 16:20:02 -0800 Lee Belbin @.***> wrote:
I'm happy enough with adding the "invalid" as I presume from @chicoreus, we can detect the 'invalidity' and that is totally different from the 'ambiguity' aspect? @tucotuco and @chicoreus ? How say you?
Avoiding "invalid", by asserting known to the source authority, how about:
EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:taxonID is EMPTY, or is not found in the bdq:sourceAuthority, or dwc:scientificName was not EMPTY; FILLED_IN the value of dwc:scientificName if the value of dwc:taxonID could be unambiguously interpreted as a value in bdq:sourceAuthority; otherwise NOT_AMENDED
It is looking like we need to document a clear and concise rule for this type of issue.
Something on the line of: in general when presented with data from which no assertion can be made due to empty or invalid values, the specification for amendments assert that prerequisites are not met, rather than asserting not amended.
Thanks @chicoreus. Your ER is more explicit, which is good.
We have one VALIDATION relating to the content of dwc:taxonID: #121, which seeks to detect a valid 'value'. Anything that doesn't pass muster results in "NOT_COMPLIANT".
So, the scenario for a NOT_COMPLIANT value from #121 would result here in an INTERNAL_PREREQUISITES_NOT_MET by your ER, not a NOT_AMENDED. Is this appropriate? It is rather subtle for me.
To summarise the issue, we have Expected Responses that take the form of
.....; INTERNAL_PREREQUISITES_NOT_MET if input is EMPTY or INVALID; COMPLIANT/AMENDED if input is 'OK'; otherwise ...
What had been bugging me on some tests were variants of
.....; INTERNAL_PREREQUISITES_NOT_MET if input is EMPTY or INVALID; COMPLIANT/AMENDED if input is VALID and 'OK'; otherwise ...
Talking with @ArthurChapman, we can see the utility of short-circuiting the 'test' for input invalidity. This gets back to the definition that Paul has suggested above that will need to go in a Preamble. Could I tweak that to cover VALIDATIONs as well to something like
INTERNAL_PREREQUISITES_NOT_MET: When a test is presented with data from which no assertion can be made due to empty or invalid values, the specification for validations or amendments assert that the prerequisites are not met, rather than asserting not compliant or not amended
?
Can we seeks agreement and direction on the previous comment please? We may need to edit the vocabulary entry for INTERNAL_PREREQUISITES_NOT_MET
I like the short-circuiting paradigm. To me it is simpler to think of it as "COMPLIANT/AMENDED if input is 'OK'", where OK includes being PRESENT and being VALID.
I agree with @tucotuco - however note the definition in #152. I suggest changing definition to:
A Response.status (q.v.) where values of Information Elements (q.v.) were insufficient to run the test due to EMPTY (q.v.) or Invalid (q.v.) values. If the test is run at a later time on unmodified data, it should produce the same Response.
Add to the comment: Note that the specification for validations or amendments assert that the prerequisites are not met, rather than asserting not compliant or not amended.
Also does this require us to change the definition of Invalid - currently
Where the Data Quality Dimension: Conformance (q.v.) is not satisfied due to Information Elements (q.v.) containing non-standard values, values outside of an acceptable range, or values unable to be found in a bdq:sourceAuthority (q.v.).
In the case of INTERNAL_PREREQUISITES_NOT_MET - we aren't proposing that we check against a bdq:sourceAuthority are we? - that sounds more like a validation for COMPLIANT/NOT_COMPLIANT. I suspect that we may be using INVALID in two different contexts. 1) in INTERNAL_PREREQUISTES_NOT_MET and 2) in checking for Compliance vis a vis a Source Authority
I agree @ArthurChapman about "invalid" being used in some tests under the INTERNAL_PREQUISITES_NOT_MET and under the COMPLIANT/AMENDED section. This was my point under #163 and elsewhere.
The conclusion seemed to be that we would be better to return INTERNAL_PREREQUISITES_NOT_MET for an invalid value. So I would agree with the change in definition and associated comment you proposed above for INTERNAL_PREREQUISITES_NOT_MET. In some cases, e.g., #76, we qualify with "invalid according to".
For the definition of "invalid", we will therefore need a change. Looking at the context it is used in the tests, maybe we can simplify it?
"Where the Data Quality Dimension: Conformance (q.v.) is not satisfied due to Information Elements (q.v.) containing non-standard values"
I like simple - done.
Do we have an inconsistency between what gbif if quoting as taxonID and what the API responds to as a taxonID? For example, as above (April 18, 2021)
https://api.gbif.org/v1/species/8102122 works but the response page quotes the taxonID as "gif:8102122"?
If you search for https://api.gbif.org/v1/species/gif:8102122, it doesn't work.
We need to be consistent on how we use what is under Parameter(s) and Source Authority In this test we have a repeat but we are inconsistent in what we have on other tests
Field | Value |
---|---|
Parameter(s) | bdq:sourceAuthority default="GBIF Backbone Taxonomy" |
Source Authority | bdq:sourceAuthority default = "GBIF Backbone Taxonomy" [https://doi.org/10.15468/39omei], "API endpoint" [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=] |
Should we have this instead
Field | Value |
---|---|
Parameter(s) | bdq:sourceAuthority |
Source Authority | bdq:sourceAuthority default = "GBIF Backbone Taxonomy" [https://doi.org/10.15468/39omei], "API endpoint" [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=] |
@Tasilee A while back, we asked @timrobertson100 "We are wondering what GBIF considers to be the desirable form for the value of a dwc:taxonID that is referencing an entry in the GBIF backbone taxonomy." quoting part of his reply: "My gut feeling is a “scope:value” format (e.g. gbif:1234) is better than a URL, for the reason that URLs are generally less stable over time."
I'd be thus inclined to interpret taxonID""gbif:8102122" as meaning the scope is the GBIF backbone taxonomy, https://registry.gbif.org/dataset/d7dddbf4-2cf0-4f39-9b2a-bb099caae36c currently accessible through the api endpoint https://api.gbif.org/v1/species/, such that https://api.gbif.org/v1/species/8102122 resolves, and that we could loosely treat gbif: as the namespace https://api.gbif.org/v1/species/, with the expectation that this will not be stable and will change over time, such that in future the api endpoint might be something like the hypothetical https://api.gbif.org/v2/taxon/. The corresponding API endpoint to query for names within the "gbif" scope is https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name={taxon name search value}, but the API https://api.gbif.org/v1/species/{int} (per the documentation), doesn't take a datasetKey parameter.
@tucotuco draws a paralell with the pseudo-namespace epsg used with geodetic datum.
Updated examples and notes to reflect recommendation for the use of the pseudo-namespace gbif: for taxonID, no change needed to the specification.
Thanks @chicoreus. All: Please advise me of any implied changes to the test data on related issues.
Restructured Parameter(s) and Source authority
Changed test data positive and negative examples. BTW, all the test examples are generated from the test data.
Updated Notes - changed "VALIDATION_TAXONID_AMBIGUOUS" to "VALIDATION_TAXONID_UNAMBIGUOUS (4c09f127-737b-4686-82a0-7c8e30841590)"
Notes reference VALIDATION_TAXONID_UNAMBIGUOUS (4c09f127-737b-4686-82a0-7c8e30841590) where 4c09f127-737b-4686-82a0-7c8e30841590 #70 is VALIDATION_TAXON_UNAMBIGUOUS, while there isn't a VALIDATION_TAXONID_UNAMBIGUOUS, the closest is #121 VALIDATION_TAXONID_COMPLETE a82c7e3a-3a50-4438-906c-6d0fefa9e984, which has notes indicating we considered its predecessor VALIDATION_TAXONID_AMBIGUOUS too complex to implement.
Looking at the example
[dwc:taxonID="gbif:8102122", dwc:scientificName="": Response.status=FILLED_IN, Response.result=dwc:scientificName="Harpullia pendula F.Muell.", Response.comment="dwc:taxonID contains an interpretable value"]
Does that mean a term change proposal should to be submitted for taxonID? One of the examples is
@ymgan likely, yes. We've got in the notes for this test: "The pseudo-namespace gbif: is recommeded by GBIF for use in taxonID to reference GBIF taxon records. " See also the coment https://github.com/tdwg/bdq/issues/71#issuecomment-1582449471
I have updated the notes in line with comment by @chicoreus, above.
Amended Source Authority values to align with @chicoreus syntax
From
bdq:sourceAuthority default = "GBIF Backbone Taxonomy" [https://doi.org/10.15468/39omei] | | | API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]
to
bdq:sourceAuthority default = "GBIF Backbone Taxonomy" {[https://doi.org/10.15468/39omei]} {API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]}
Hello, I am having difficulty trying to look at how OBIS can align its data quality check for taxon specifically with these 2 tests below:
Checks | Fields |
---|---|
Taxon should unambiguously match with WoRMS. | scientificName, scientificNameID |
From what I understood, the reason that scientificNameID is used instead of taxonID is because WoRMS lacks stable identifiers for taxon concepts.
scientificNameID is a mandatory field for OBIS. Bob also made a comment here about the lack of usage of taxonID in datasets that he worked with.
I guess my question is, would adding a test for scientificNameID makes sense? or does it make sense to have taxonID/scientificNameID as a parameter of these tests?
Any guidance would be very much appreciated, thanks a lot!!
@ymgan at the MCZ, and I think within the TG2 working group (coming out of a long history of discussions at NOMINA meetings) we've come to the opposite conclusion. I expect @tucotuco and others will want to comment as well, perhaps to correct my understanding. dwc:taxonID has the definition: "An identifier for the set of dwc:Taxon information. May be a global unique identifier or an identifier specific to the data set." In the layers of name strings, name string bins, nomenclatural acts, taxon concepts, and classifications worked out in various NOMINA meetings (heavily influenced by the thinking of the late Dave Remsen) dwc:taxonID is (I believe by design) vauge. It is an identifier for the package of information associated with a Taxon class, without linking a particular meaning (name string, nomenclatural act, taxon concept, taxon concept including classification) to the instance of the Taxon class. The dwc:taxonID serves as the identifier for the set of information in the terms in a dwc:Taxon instance, without applying additional semantics to the dwc:Taxon instance.
On the other hand the definition for dwc:scientificNameID is "An identifier for the nomenclatural (not taxonomic) details of a scientific name." That is explicitly pointing at an authoritaitve source of information on nomenclatural acts, nomenclators. There are very few of these. IPNI is one, IndexFungorum another, ZooBank another. They explicitly assign identifiers to nomenclatural acts. WoRMS is not one of these. It would not be appropriate to report an LSID from WoRMS as a scientificNameID. The LSID from WoRMS would appropriately go in the taxonID, See the examples given in the various Taxon ...ID terms in Darwin Core. scientificNameID lists only an ipni LSID, others list multiple possibilities, including references to GBIF's backbone taxonomy.
More broadly, in designing tests around CORE uses, we considered one term (or a package of terms) within each of the TIME/SPACE/NAME concept areas to have primacy, for names, dwc:taxonID, for time, dwc:eventDate, for space dwc:decimalLatitude + dwc:decimalLongitude + dwc:geodeticDatum + dwc:coordinateUncertaintyInMeters + dwc:coordinatePrecision, with other terms in each area providing alternative representations (often with the ability to represent only less complete information as in dwc:year, dwc:month, dwc:day) or providing supplemental metadata (as in dwc:georeferenceProtocol). For the Taxon class terms, we deliberately chose dwc:taxonID as the term with primacy, and these two tests reflect this, being a test that can fill in an empty taxonID from other taxon terms (#57) or this one, use the taxonID to fill in other terms.
That having been said, and having just integrated the sci_nameqc implementation of the NAME tests into MCZbase, I suspect we've got some more work to do around "OBIS uses ... WoRMS for the ... taxon classification". In MCZbase, we are currently only using WoRMS or IRMNG LSIDs for taxonID, and only ZooBank identifiers for scientificNameID (though we may start using GBIF gbif:{integer} identifiers to link to GBIF backbone taxonomy records. The assumptions around the NAME tests are that a data source will use a single authority. Within MCZbase data, or when WoRMS data are aggregated with other data, subsets of the data will rely on different authorities for slices of the data, and the test specifications assume that to have quality, data must be conformed to a single authority. For use within OBIS, this test #71 is straightforward, you simply specify WoRMS as the bdq:sourceAuthority instead of GBIF Backbone Taxonomy. Similarly for #81, #22, the VALIDATION{higherrank}_FOUND tests, just specify WoRMS as the authority to check against. But #70 VALIDATION_TAXON_UNAMBIGUOUS poses more of a challenge. You may be aggregating data that uses WoRMS identifiers in some cases, IPNI identifiers in others, IRMNG identifiers in others, GBIF identifiers in others, similarly for other aggregators. #70 may not adequately address multiple reliable sources of authority within a data set.
s/gbif:/https:\/\/api.gbif.org\/v1\/species\//
will transform the value taxonID=gbif:8102122 to the resolvable endpoint https://api.gbif.org/v1/species/8102122 The pseudo-namespace "gbif:" is recommended by GBIF to reference GBIF taxon records. Where resolvable persistent identifiers exist for dwc:scientificNameID values, they should be used in full, but implementors will need to support at least the "gbif:" pseudo-namespace.