tdwg / tnc

Taxonomic Names and Concepts Interest Group
22 stars 7 forks source link

Should we have parsed authorship properties? #24

Closed nielsklazenga closed 5 years ago

nielsklazenga commented 5 years ago

In the meeting of 4 December (#20) we decided on not having parsed authorship properties. In a comment on the meeting notes @mdoering makes a case in favour of having them:

There are various levels of parsing. Parsing the entire authorship to individual authors is harder, but I believe we should at least keep basionym/bracket authorship separate from the combination authorship. And probably also have the year parsed out for each of them because these are things one often needs to compare or process. But even splitting authorteams into lists of individual authors is sth that people deal with (both GNA and GBIF name parsers do it) and I would very welcome a standard that allows to exchange such parsed data.

Warrants more discussion...

mdoering commented 5 years ago

Both the GNA and GBIF name parser are great libraries that take a single name string apart into its pieces. But due to the many exceptions and varieties of name strings out there they will never be 100% perfect and sometime human corrections are needed. Being able to exchange these broken down names is therefore important. It would even allow a better standardisation of name parser APIs.

Example response from GBIF: http://api.col.plus/parser/name?name=Stagonospora+polyspora+M.T.+Lucas+%26+Sousa+da+C%C3%A2mara+1934

And GNA: http://parser.globalnames.org/?q=Stagonospora+polyspora+M.T.+Lucas+%26+Sousa+da+C%C3%A2mara+1934

mdoering commented 5 years ago

Being able to compare names is an important use case. Comparing the authorship of names is the hardest part and here it is essential to break down the authorship

deepreef commented 5 years ago

To me, it has never been a question that we do definitely need parsed authorships (we clearly do). The question has been where in the standards these parsed authorships belong. Specifically, are they properties of names, or are they properties of References. If the emphasis is on capturing and processing dirty string data, then it's probably helpful to have some ability to capture and parse authors at the name level. If the emphasis in on capturing and expressing clean data, then the authorships should be derived from usages linked to References (i.e., the authors are linked to References as Agent-objects, and are derived for purposes of formatting labels of Usages and of Names).

I assume we need to accommodate both, so I think it's more a question of workflow. What I mean by that is that we have two extreme end-points:

1) Raw text string representing a taxon name of some sort, that may include misspellings, qualifiers, abbreviations, authors, year, etc.

2) Fully normalized and cleaned data structure involving fleshed-out usages linked to Protonyms and References, with References linked to Agents (which themselves may have multiple names), from which properly-formatted labels for names and usages are very easily derived.

Number 1 represents the majority of what we have. Number 2 represents where all the data will be in an ideal world. And our challenge is to figure out what the intermediate steps are between 1 and 2 that allow us the best opportunity to improve the utility of the data at various steps along the way, while simultaneously allowing all the data to shift towards Number 2.

In that context, I think we can safely assume that usage instances should inherit parsed authorship information from linked References (where we already KNOW we need to deal with parsed authorships). Thus, I don't think we need to worry about accommodating parsed authorships directly for usages. The question is whether there is utility in accommodating parsed authorships directly associated with names instances, as one of those intermediate steps to get us from Number 1 to Number 2 in an effective way.

My sense is that the answer to this question is "probably yes".

nielsklazenga commented 5 years ago

Just linking this to #6.

Apart from splitting up the nomenclatural authorship into author teams, at least in botany we've got canonical author abbreviations. This is domain-specific, so should probably be a property in our standard, even if it is only ever used in a foaf:Person (for example), just so that it is available.

baskaufs commented 5 years ago

I think this might be the right thread to make a comment as a followup to the meeting this week. The meeting seemed to me to be focused on an attempt to define terms based on theoretical ideas about what terms should mean. Although as a non-taxonomist I was left feeling like I had little to contribute to the discussion, I ended up with the feeling that we are to some extent going about this backwards. The more experience I have with vocabulary development, the more convinced that the correct approach is not to try to decide "what a term means", but rather to decide "what we want a term to do". That is, the purpose of minting a term is not to nail down some philosophical ideas about how we think about the terms, but rather to create terms that will satisfy the use cases that have been determined to be important.

That sentiment is captured in Rich's comment earlier in the thread, where he talks about the extreme end-points of the workflow. On the one end, we have dirty data that comes straight from people's spreadsheets. On the other end, we have curated, clean data that results from hard work done by people or algorithms designed to transform those dirty data into a form that is most useful for searching, building applications, and reasoning out relationships. If we are only seeking to facilitate one end of the spectrum or the other, then we end up with very different kinds of vocabularies. However, if we are trying to develop a vocabulary that enables the process of moving from one end of the spectrum to the other, then we need to broaden the list of terms such that they can include both the dirty and clean data.

When we developed the Darwin Core RDF Guide, we minted the dwciri: terms for the specific purpose of enabling nerdy RDF features. But since then, I've come to understand that the approach we took actually makes it possible to handle the dirty/clean situation regardless of whether the technology was RDF or something more conventional.

Essentially, for each kind of entity that we care about, we should probably have two terms, one for the dirty string data (which I'm going to call the verbatim term), and one for a URI representing the cleaned data (or some alternative globally unique identifier like a UUID, although for the purposes of at least keeping the future possibility of Linked Data, a URI would probably be better). Associated with the URI term are one to many additional fields that either contain cleaned string data or links to controlled value terms or entities elsewhere (such as DOIs or ORCID IDs).

When a dirty dataset comes in, all of the verbatim fields get populated with whatever is there. When the dataset is processed, parsed, cleaned up, etc. both the URI field and a set of fields associated with the URI are populated with clean data. A subsequent user would then be able to know whether cleaning had happened on the data simply by looking to see if the URI field was populated or not. Because it's possible to make additional statements about URI-identified resources, the "clean" data fields associated with the URI can also include information such as who did the cleaning, when, and using what methodology. Once the cleaning was done, it wouldn't need to be done again unless someone at a later time determined that there was something wrong or incomplete about the processing. In that case, one could generate one or more additional records documenting the processing that was done and the fields that resulted from that processing.

I could go on, but I'll leave it at that. In order to make decisions about the exact terms needed and what their values should be populated with, we need to create a model that shows how the terms will be related to each other (e.g. what relationships are one-to-many and which are one-to-one). That model could be expressed as either an ER diagram for a relational database, or a "bubble" diagram representing a graph - the two are mostly interconvertable. Comparison of that developing model with the use cases that need to be satisfied will take us down the road to final decisions about term definitions and restrictions on the values of those terms. Also during that process, we can think about how the model would actually be implemented in ways that we care about, such as Darwin Core archives, relational databases, and potentially Linked Data.

deepreef commented 5 years ago

Many thanks, @baskaufs - you captured what I was thinking much more effectively than I was able to, and your post helps me clarify my own thinking in my mind.

I guess what I was searching for was logical "nodes" or "waypoints" between the two end points that would be effective as intermediate steps, and perhaps identifying those steps will help us better define vocabularies that address both ends of the spectrum as well as the key stages in-between. Perhaps our focus should be on defining what those stages might look like.

That said, I FULLY agree that the two key terms needed for each object are a verbatim term and an identifier term. I have absolutely no qualms about using URIs for the latter; but I strongly recommend that such URIs be formed with an embedded UUID as the "identifier" part, prefixed with appropriate dereferencing metadata to convert it into a proper URI. I imagine it's outside the scope of the TCS 2.0 standard to make such recommendations; but I fully intend to remain on that soapbox in any case.

One caveat, though: not all the "clean" fields are derivable from the verbatim string. We can of course parse the bits using the best algorithms available, and even trust them; but there are many properties we probably want to associated with the clean records that are not embedded within the verbatim string.

Also, I imagine the normal/common workflow for processing/parsing/cleaning/fleshing-out step to be iterative. That is, bits of "clean" missing data may trickle in over time. Thus, I'm not sure there are only going to be two states (i.e., zero clean data denoted by no URI, vs. fully cleaned data denoted by a URI). But at least the presence of a URI would indicated that the first step of processing/parsing/cleaning/fleshing-out had begun. If I understand you correctly, you're talking about capturing versioning through multiple instances; which implies that the URIs are identifiers for the data record, rather than the conceptual object. That's fine, as long as it's clear. I still believe we need persistent identifiers for the conceptual objects as well, but I assume those could simply be captured among the various "clean" data properties?

In any case, I think this is the right approach to discussing how to move forward.

nielsklazenga commented 5 years ago

@baskaufs Are you suggesting something like this (analogous to Audubon Core)?

(@deepreef I don't know how the 'basionym' bit sits with zoologists. This is just a straw man that I dreamt up (largely stole from the NSL actually). We should be able to come up with something everybody can live with).

The non-literal properties could have a foaf:Agent (foaf:Person or foaf:Group) as object/range.

The discussion above is relevant to issue #6, so just linking that in.

nielsklazenga commented 5 years ago

I had a whole spiel with examples about why there might be a source of confusion between botanists and zoologists, but then I realised I was only confusing myself. I still think some examples would be helpful, so I ran some names to the GNParser (the GBIF Name Parser doesn't seem to parse authors):

Dicranoloma braunii (Bosch & Sande Lac.) Paris

{
  "quality": 1,
  "parsed": true,
  "verbatim": "Dicranoloma braunii (Bosch & Sande Lac.) Paris",
  "surrogate": false,
  "normalized": "Dicranoloma braunii (Bosch & Sande Lac.) Paris",
  "canonicalName": {
    "value": "Dicranoloma braunii",
    "valueRanked": "Dicranoloma braunii"
  },
  "virus": false,
  "positions": [["genus", 0, 11], ["specificEpithet", 12, 19], ["authorWord", 21, 26], ["authorWord", 29, 34], ["authorWord", 35, 39], ["authorWord", 41, 46]],
  "nameStringId": "8fbfc7a2-8565-5cd7-abe7-2b1d8caa3e7b",
  "parserVersion": "1.0.2",
  "hybrid": false,
  "details": [{
    "genus": {
      "value": "Dicranoloma"
    },
    "specificEpithet": {
      "value": "braunii",
      "authorship": {
        "value": "(Bosch & Sande Lac.) Paris",
        "basionymAuthorship": {
          "authors": ["Bosch", "Sande Lac."]
        },
        "combinationAuthorship": {
          "authors": ["Paris"]
        }
      }
    }
  }],
  "bacteria": false
}

Dicranum braunii Müll.Hal. ex Bosch & Sande Lac.

{
  "quality": 2,
  "parsed": true,
  "verbatim": "Dicranum braunii Müll.Hal. ex Bosch & Sande Lac.",
  "surrogate": false,
  "qualityWarnings": [[2, "Ex authors are not required"]],
  "normalized": "Dicranum braunii Müll. Hal. ex Bosch & Sande Lac.",
  "canonicalName": {
    "value": "Dicranum braunii",
    "valueRanked": "Dicranum braunii"
  },
  "virus": false,
  "positions": [["genus", 0, 8], ["specificEpithet", 9, 16], ["authorWord", 17, 22], ["authorWord", 22, 26], ["authorWord", 30, 35], ["authorWord", 38, 43], ["authorWord", 44, 48]],
  "nameStringId": "7a74c523-abf3-5176-84bf-9e85ac09497f",
  "parserVersion": "1.0.2",
  "hybrid": false,
  "details": [{
    "genus": {
      "value": "Dicranum"
    },
    "specificEpithet": {
      "value": "braunii",
      "authorship": {
        "value": "Müll. Hal. ex Bosch & Sande Lac.",
        "basionymAuthorship": {
          "authors": ["Müll. Hal."],
          "exAuthors": {
            "authors": ["Bosch", "Sande Lac."]
          }
        }
      }
    }
  }],
  "bacteria": false
}

I am happy with these results, with the proviso that the authors and exAuthors are the wrong way around for botanical names (which these examples happen to be). Us botanists might have to just suck that up, or prevail on GNA to include both ways in the parser result.

So, new proposal:

mdoering commented 5 years ago

@nielsklazenga for author parsing please use the latest version of the GBIF parser which is exposed here: http://api.col.plus/parser/name?name=Stagonospora+polyspora+M.T.+Lucas+%26+Sousa+da+C%C3%A2mara+1934

mdoering commented 5 years ago

Dicranoloma braunii (Bosch & Sande Lac.) Paris

// http://api.col.plus/parser/name?name=Dicranoloma%20braunii%20%28Bosch%20%26%20Sande%20Lac.%29%20Paris

[
  {
    "name": {
      "scientificName": "Dicranoloma braunii",
      "rank": "species",
      "genus": "Dicranoloma",
      "specificEpithet": "braunii",
      "candidatus": false,
      "combinationAuthorship": {
        "authors": [
          "Paris"
        ]
      },
      "basionymAuthorship": {
        "authors": [
          "Bosch",
          "Sande Lac."
        ]
      },
      "code": "botanical",
      "type": "scientific",
      "parsed": true,
      "authorship": "(Bosch & Sande Lac.) Paris"
    }
  }
]

Dicranum braunii Müll.Hal. ex Bosch & Sande Lac.

// http://api.col.plus/parser/name?name=Dicranum%20braunii%20M%C3%BCll.Hal.%20ex%20Bosch%20%26%20Sande%20Lac.

[
  {
    "name": {
      "scientificName": "Dicranum braunii",
      "rank": "species",
      "genus": "Dicranum",
      "specificEpithet": "braunii",
      "candidatus": false,
      "combinationAuthorship": {
        "authors": [
          "Bosch",
          "Sande Lac."
        ],
        "exAuthors": [
          "Müll.Hal."
        ]
      },
      "type": "scientific",
      "parsed": true,
      "authorship": "Müll.Hal. ex Bosch & Sande Lac."
    }
  }
]

Ex-authors are a pain because the ordering indeed is inversed in zoology. But they hardly ever exist in zoological names, so I strongly recommend to follow the botanical habit as the default. To get real good parsing results name parsers need to know the nomenclatural code which the name adhers to (can't find the right parameter name for the GBIF parser right now, but it should then switch the ex-authors)

mdoering commented 5 years ago

As for additional literal/verbatim terms I strongly suggest to read https://github.com/tdwg/dwc/issues/181 which fails to come up with a final conclusion, but seems to lean towards avoiding extra 1:1 sister terms just for the verbatim form. It should be handled differently as its a provenance problem and you can get many different versions/interpretations for one record.

mdoering commented 5 years ago

Based on @nielsklazenga I would then propose:

mdoering commented 5 years ago

Some more zoological & bacterial examples

Pseudomonas syringae pv. aceris (Ark, 1939) Young, Dye & Wilkie, 1978

// http://api.col.plus/parser/name?name=Pseudomonas%20syringae%20pv.%20aceris%20%28Ark%2C%201939%29%20Young%2C%20Dye%20%26%20Wilkie%2C%201978

[
  {
    "name": {
      "scientificName": "Pseudomonas syringae pv. aceris",
      "rank": "pathovar",
      "genus": "Pseudomonas",
      "specificEpithet": "syringae",
      "infraspecificEpithet": "aceris",
      "candidatus": false,
      "combinationAuthorship": {
        "authors": [
          "Young",
          "Dye",
          "Wilkie"
        ],
        "year": "1978"
      },
      "basionymAuthorship": {
        "authors": [
          "Ark"
        ],
        "year": "1939"
      },
      "code": "bacterial",
      "type": "scientific",
      "parsed": true,
      "authorship": "(Ark, 1939) Young, Dye & Wilkie, 1978"
    }
  }
]

Acipenser gueldenstaedti colchicus natio danubicus Movchan, 1967

// http://api.col.plus/parser/name?name=Acipenser%20gueldenstaedti%20colchicus%20natio%20danubicus%20Movchan%2C%201967

[
  {
    "name": {
      "scientificName": "Acipenser gueldenstaedti natio danubicus",
      "rank": "natio",
      "genus": "Acipenser",
      "specificEpithet": "gueldenstaedti",
      "infraspecificEpithet": "danubicus",
      "candidatus": false,
      "combinationAuthorship": {
        "authors": [
          "Movchan"
        ],
        "year": "1967"
      },
      "code": "zoological",
      "type": "scientific",
      "parsed": true,
      "authorship": "Movchan, 1967"
    }
  }
]
nielsklazenga commented 5 years ago

Thanks @mdoering, would you have an example of zoological name with an ex-author?

Regarding tdwg/dwc#181, literal vs. node/resource, verbatim vs. interpreted and provided/raw vs. processed are all different discussions. Here, we are talking about the first: literal vs resource. Darwin Core already has that in the form of the 'dwc' and 'dwciri' namespaces (and to some extent also in having both scientificName and sciectificNameID etc.), 'dwc' having the properties that have literals as their objects and 'dwciri' the properties that have nodes. I would also argue that Darwin Core already has properties that take verbatim values (e.g. all the verbatim... and ...Remarks terms) and properties that take interpreted values (all the properties for which using a controlled vocabulary is recommended best practice). Issue tdwg/dwc#181 is, or has become, mostly about having different properties for raw and processed data and I agree that we shouldn't have that. A "verbatimIdentification" (incl. identification qualifiers and name addenda, if any), however, would be a nice thing to have.

baskaufs commented 5 years ago

I have posted an issue tdwg/tag#22 that is related to this one. It suggests using the W3C SKOS-XL model as a generic method for handling labels across TDWG vocabularies. I used some of the parsed string examples from this thread in my illustrations of how it could work.

The document of examples linked in that issue does not get into the issues we've discussed here about the processing of verbatim name strings and the issue of having verbatim and non-verbatim versions of terms. But I can make a suggestion based on the (possibly) oversimplified system that was included in TCS 1.0 .

TCS 1.0 included two analogous terms: tc:hasName and tc:nameString. Those terms were defined strictly in the TaxonConcept ontology document. tc:nameString was a datatype property, indicating that its value should be a string literal, and tc:hasName was an object property, indicating that its value should be a URI representing an abstract name object (possibly the "name thing" I refer to in my SKOS-XL backgrounder). You can see this illustrated graphically in this diagram. I believe that the idea in TCS 1.0 was that every provider would provide at least the name string and if possible, more information about the name entity would be fleshed out through properties of the tn:TaxonName instance. tc:nameString is essentially the literal value term (analogous to a term in the dwc: namespace) and tc:hasName is essentially a URI value term (analogous to a term in the dwciri: namespace).

So here's how I would imagine how these TCS 1.0 terms would work with SKOS-XL. A provider with limited data management capabilities might provide only a value of tc:nameString in a spreadsheet. An aggregator would then process the string, generating tn:TaxonName and skosxl:Label instances as necessary. The provenance information about the provider and its dataset would then be associated with the appropriate skosxl:Label instance. There wouldn't really need to be a verbatim term because every skosxl:label instance has the exact form of the string as its skosxl:literalForm value and the provider information could be associated with the label instance.

Alternatively, a provider with more robust data management capabilities might just skip the tc:nameString part and directly generate the tn:TaxonName and skosxl:Label instances and metadata on the fly.

Processed label data could also be transferred among providers in the form of tn:TaxonName and skosxl:Label metadata. In that case, there would be no additional processing required nor any guessing about the properties and provenance of the strings because that information would be included in the metadata.

As I said, we may call the "name things" something different than tn:TaxonName, but this general approach for recording and linking data about strings to abstract entities should work regardless. It would also work equally well for any kind of situation where we have URI- and literal-value analogs of terms (dwc:recordedBy and dwciri:recordedBy, ac:reviewerLiteral and ac:reviewer, etc.).

nielsklazenga commented 5 years ago

Thanks @baskaufs, I think it is worth looking into how (and if) this can be applied to in our little subdomain. We'll discuss it at the next meeting.