tdwg / bdq

Biodiversity Data Quality (BDQ) Interest Group
https://github.com/tdwg/bdq
43 stars 7 forks source link

TG2-AMENDMENT_POLYNOMIAL_STANDARDIZED #45

Closed iDigBioBot closed 4 years ago

iDigBioBot commented 6 years ago
TestField Value
GUID 8ab38bee-323c-4926-a7e9-c0417cd3b14d
Label AMENDMENT_POLYNOMIAL_STANDARDIZED
Description Amend the scientific name to correct typographical errors and misspellings according to a specified source authority.
TestType Amendment
Darwin Core Class Taxon
Information Elements ActedUpon dwc:scientificName
dwc:genericName
dwc:specificEpithet
dwc:infraSpecificEpithet
dwc:scientificNameAuthorship
dwc:yearOfPublication
Information Elements Consulted
Expected Response EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available; INTERNAL_PREREQUISITES_NOT_MET if dwc:scientificName is bdq:Empty; AMENDED (dwc:scientificName, genus, specificEpithet, infraspecificEpithet, scientificNameAuthorship, yearOfPublication) if typographical errors and misspellings represented in dwc:scientificName have been unambiguously interpreted in the bdq:sourceAuthority; otherwise NOT_CHANGED
Data Quality Dimension Conformance
Term-Actions POLYNOMIAL_STANDARDIZED
Parameter(s) bdq:sourceAuthority
Source Authority [bdq:sourceAuthority ](bdq:sourceAuthority default = "GBIF Backbone Taxonomy" {[https://doi.org/10.15468/39omei]} {API endpoint [https://api.gbif.org/v1/species?datasetKey=d7dddbf4-2cf0-4f39-9b2a-bb099caae36c&name=]})
Specification Last Updated 2024-04-16
Examples [dwc:scientificName="Acacia longifloia" Response.status=AMENDED, Response.result=dwc:scientificName"Acacia longifolia" Response.comment="dwc:scientificName contains an interpretable value in the bdq:sourceAuthority"]
[dwc:scientificName="Acacia camptophylla": Response.status=NOT_AMENDED, Response.result="", Response.comment="dwc:scientificName does not contain an interpretable value as there are a number of options in the bdq:sourceAuthority"]
Source Tania Laity
References
Example Implementations (Mechanisms)
Link to Specification Source Code
Notes [bdq:sourceAuthority default = GBIF Backbone Taxonomy]. (Currently found at: https://www.gbif.org/en/developer/species). The purpose of this Amendment is to correct errors in spelling and typography only. It is not intended to make changes of a taxonomic nature or to deal with errors or inconsistencies in the format of the Authorship.
iDigBioBot commented 6 years ago

Comment by Paul Morris (@chicoreus) migrated from spreadsheet: The ability to assert a correction to a scientific name string is almost always restricted to proposed corrections to the authorship portion of the string. Much more effective to supply a link to a taxonID found in a nomenclator or taxonomic authority when an unambigouus match can be found than to attempt to alter the string value found in scientificName. An amendment affecting dwc:scientificNameAuthorship, on the other hand, is highly valuable, as the authorship string tend to be highly variable in construction.

chicoreus commented 4 years ago

See also #46 seems to be paired and have the same issues (should be AMENDMENT_SCIENTIFICNAME_STANDARDIZED?). See also: #101 which does seem a legitimate "polynomial" test.

ArthurChapman commented 4 years ago

I have changed the wording of the Notes

FROM: This test is not intended to make alterations of a taxonomic nature. The intent of this test is not to fix errors or inconsistencies in the format of the dwc:scientificNameAuthorship. For the purpose of this amendment, if the genus in the dwc:genus field does not match the genus of the polynomial, the genus of the polynomial takes precedence for standardization.

TO: The purpose of this Amendment is to correct errors in spelling and typography only. It is not intended to make changes of a taxonomic nature or to deal with errors or inconsistencies in the format of the Authorship. For the purpose of this amendment, if the genus in the dwc:genus field does not match the genus of the polynomial, the genus of the polynomial takes precedence for standardization.

chicoreus commented 4 years ago

@ArthurChapman improvement in expressing an intent, though a problematic one. Also, "Polynomial" is still problematic. there is no dwc:polynomial,. dwc:scientificName can contain either a uninomial or a polynomial, depending on the rank of the identification. A polynomial (with danger, as darwin core defines genus as the current classification of the scientific name, not the generic part of the dwc:scientificName) can be built from dwc;genus plus dwc;specificEpithent plus dwc:infraspecificEpithet if dwc:specificEpithet is populated, but the specification is mute about what is meant by polynomial in the notes, and the specification does not appear to include a need for terms other than dwc:scientificName, with according to the notes, some unspecified magic removing the authorship from consideration in that value.... The specification is currently mute on authorship, so an implementor's presumption would be that what is to be compared is the entire value found in the dwc:scientificName as compared with the best match in the specified source authority. If there is a desire to not include authorship, then there must be an unambigous specification as to how this is to be done (either with a (defined) parser, or removing the value found in dwc:scientificNameAuthorship from the end of the value found in dwc:scientificName, or by using a defined beginning of string only matching method on the source authority side). As currently phrased, the notes still represent magical thinking about the ability to detect which part of dwc:scientificName is the authorship and which parts are not for the wide range of names of all ranks, hybrids, and complex authorship strings under each of the codes, including the presence of initial capital letters in specific, subspecific, and infraspecific epithets in historical names, authorship strings embedded within name strings for hybrids and trinomials and quadranomials, and all sorts of interesting common cases.

Tasilee commented 4 years ago

After a fun discussion with @ArthurChapman, I think this boils down to how I responded to @chicoreus via email: POLYNOMIAL entails parsing on our end, but we assume parsing within the bdq:sourceAuthority as in the case of #57, don't we? My feeling is we remove #46 and #45 because @chicoreus informs us it is complex?

My point is we throw whatever is in dwc:scientificName at bdq:sourceAuthority with #57.

ArthurChapman commented 4 years ago

The original idea for Tests #45 and #46 was to fix minor spelling errors in the names (i.e. smithi versus smithii, litoralis versus littoralis etc.). This is something that CRIA does very well with its tests. There were other tests that involve the Taxon, TaxonID, and Scientific Name (+others). If we included Authorship and rank (var., ssp.) in these tests, then we are basically making these tests a duplication of other tests we already have (i.e. those dealing with combinations of TAXONID, TAXON and SCIENTIFICNAME). Given that, and the difficulty that @chicoreus mentions with parsing out the polynomial components from dwc:scientificName, etc., I see little value in continuing with these two tests (#45 and #46). I thus suggest that we simplify the process and change these two tests to SUPPLEMENTARY.

chicoreus commented 4 years ago

An alternative to moving this test to supplementary would be to specify an explicit means of handling the authorship in this test, for example:

change name from amendment polynomial standardized to amendment namestring standardized.

information elements: dwc:scientificName, dwc:scientificNameAuthorship

specification: EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority service was not available; INTERNAL_PREREQUISITES_NOT_MET if either dwc:scientificName or dwc:scientificNameAuthorship is EMPTY; AMENDED if the text string represented in dwc:scientificName with the text string present in dwc:scientificNameAuthorship removed from the end of is not a match for a scientific name string in the bdq:sourceAuthority and it can be unambiguously corrected to the name string of a known scientific name string consisting of the same number of words (here we could specify a maximum string distance for transformation) according to the bdq:sourceAuthority; otherwise NOT_CHANGED

A similar test with consideration of authorship could be included as supplemental.

In the notes, note that #70 identifies whether the specified source authority has an unambiguuous single record for the taxon, including the higher classification and authorship string, that #101 identifies inconsistencies between the scientific name and the atomic fields, and that #57 is the key amendment to propose a taxon id given the textual terms, including authorship.,

ArthurChapman commented 4 years ago

That might work @chicoreus - it still has the problem of rank (ie. straight trinomial, trinomial with var., ssp., subsp., forma, f., etc.)

Tasilee commented 4 years ago

I tend to agree with @ArthurChapman. Once we open the Pandora's Box of parsing dwc:scientificName, don't we need specific rules based upon a vocabulary that can assure us of a high probability of success? Flagging a potential issue as in the VALIDATION #46 is an equal challenge, but a safer test than this AMENDMENT.

We also have the following tests that seem to me to have similar problems (as noted by @chicoreus):

101: "COMPLIANT if the polynomial, as represented in dwc:scientificName, is consistent with the atomic parts dwc:genus, dwc:specificEpithet, dwc:infraspecificEpithet;..."

46: "COMPLIANT if there are no nomenclatural errors (e.g. typographical errors and misspellings) of a polynomial, as represented in dwc:scientificName according to the bdq:sourceAuthority service; ..."

70: " COMPLIANT if the combination of values of dwc:Taxon terms (dwc:scientificName, dwc:scientificNameAuthorship, dwc:subgenus, dwc:genus, dwc:family, dwc:order, dwc:class, dwc:phylum, dwc:kingdom, dwc:taxonRank) can be unambiguously resolved by the specified source authority service; ..."

57: "AMENDED if a value for dwc:taxonID is unique and resolvable on the basis of the value of the lowest ranking not EMPTY taxon classification terms dwc:scientificName, dwc:scientificNameAuthorship, dwc:kingdom, dwc:phylum, dwc:class, etc.; ..." (and I will change "etc" as this doesn't look good.

My inclination is to mirror the "GENUS_NOTFOUND", FAMILY_NOTFOUND", "ORDER_NOTFOUND", "CLASS_NOTFOUND", "KINGDOM_NOTFOUND" with "(VALIDATION)_SCIENTIFICNAME_NOTFOUND" by send whatever is in dwc:scientificName to the bdq:sourceAuthority and don't have an equivalent amendment. I understand that a) it depends on the smarts of the bdq:sourceAuthority (which has to increase quickly) and b) accepting we may get many false positives. But one of the criteria for accepting a high number of false positives is that it highlights a significant issue. I'd still get rid of #46 and #45.

tucotuco commented 4 years ago

I am in accord with the conclusions of Lee's final paragraph.

On Tue, Jul 14, 2020 at 8:59 PM Lee Belbin notifications@github.com wrote:

I tend to agree with @ArthurChapman https://github.com/ArthurChapman. Once we open the Pandora's Box of parsing dwc:scientificName, don't we need specific rules based upon a vocabulary that can assure us of a high probability of success? Flagging a potential issue as in the VALIDATION

46 https://github.com/tdwg/bdq/issues/46 is an equal challenge, but a

safer test than this AMENDMENT.

We also have the following tests that seem to me to have similar problems (as noted by @chicoreus https://github.com/chicoreus):

101 https://github.com/tdwg/bdq/issues/101: "COMPLIANT if the

polynomial, as represented in dwc:scientificName, is consistent with the atomic parts dwc:genus, dwc:specificEpithet, dwc:infraspecificEpithet;..."

46 https://github.com/tdwg/bdq/issues/46: "COMPLIANT if there are no

nomenclatural errors (e.g. typographical errors and misspellings) of a polynomial, as represented in dwc:scientificName according to the bdq:sourceAuthority service; ..."

70 https://github.com/tdwg/bdq/issues/70: " COMPLIANT if the

combination of values of dwc:Taxon terms (dwc:scientificName, dwc:scientificNameAuthorship, dwc:subgenus, dwc:genus, dwc:family, dwc:order, dwc:class, dwc:phylum, dwc:kingdom, dwc:taxonRank) can be unambiguously resolved by the specified source authority service; ..."

57 https://github.com/tdwg/bdq/issues/57: "AMENDED if a value for

dwc:taxonID is unique and resolvable on the basis of the value of the lowest ranking not EMPTY taxon classification terms dwc:scientificName, dwc:scientificNameAuthorship, dwc:kingdom, dwc:phylum, dwc:class, etc.; ..." (and I will change "etc" as this doesn't look good.

My inclination is to mirror the "GENUS_NOTFOUND", FAMILY_NOTFOUND", "ORDER_NOTFOUND", "CLASS_NOTFOUND", "KINGDOM_NOTFOUND" with "(VALIDATION)_SCIENTIFICNAME_NOTFOUND" by send whatever is in dwc:scientificName to the bdq:sourceAuthority and don't have an equivalent amendment. I understand that a) it depends on the smarts of the bdq:sourceAuthority (which has to increase quickly) and b) accepting we may get many false positives. But one of the criteria for accepting a high number of false positives is that it highlights a significant issue. I'd still get rid of #46 https://github.com/tdwg/bdq/issues/46 and #45 https://github.com/tdwg/bdq/issues/45.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tdwg/bdq/issues/45#issuecomment-658469987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ726PJDN4AXNI3RYXTEDR3TWONANCNFSM4EKSMKDQ .

Tasilee commented 4 years ago

I agreement with the quorum from the email responses on July 15, 2020, this amendment was considered too difficult to implement with confidence, for the present.

chicoreus commented 7 months ago

From the discussion, this is still immature and needs substantive further consideration. Removing from supplementary and tagging as immature.

Updated the markdown to reflect current practice, added a source authority in current form.

Since this was written, dwc:genericName has come into use, so replacing dwc:genus (the classification term) with dwc:genericName (the atomic generic part of the scientific name).

Additional terms (dwc:subgenus, dwc:infragenericEpithet, dwc:cultivarEpithet) might be appropriate to include as information elements acted upon.

One point for further consideration is if this test should operate on just dwc:scientificName, or if it should operate on that term and all the atomic component terms (dwc:genericName, dwc:specificEpithet, etc). This test might also consider dwc:scientificNameID as an information element consulted. Substantial thought and testing needed to bring this test to maturity.

ArthurChapman commented 7 months ago

@chicoreus - you missed adding "a source authority in current form."

chicoreus commented 7 months ago

@ArthurChapman fixed.

ArthurChapman commented 7 months ago

Examples edited to conform with current practice of providing both a pass and fail example.

Tasilee commented 7 months ago

Aligned parameters to current template

Tasilee commented 6 months ago

Fixed typos/errors in specifications to align with current template

Tasilee commented 5 months ago

Standardized reference to "EXTERNAL_PREREQUISITES_NOT_MET if the bdq:sourceAuthority is not available" in Expected Response.