tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
205 stars 70 forks source link

New Term - MaterialCitation #329

Closed myrmoteras closed 3 years ago

myrmoteras commented 3 years ago

New term

Efficacy requirement: Within this community there is a consensus that this new class will accomplish the desired outcome. The equivalent term material-citation (https://terms.tdwg.org/wiki/tp:material-citation) in the TaxPub Journal Article Tag Suit is already used in the production of over 30 scholarly journals (eg. http://plazi.org/resources/schemas-and-ontologies/taxpub/ ) .

Stability requirement: Since this is a new class, this will not interfere with existing implementations but rather contribute to resolve a well known issue.

Proposed attributes of the new term:

BioCASe/ABCD provides for a slightly different set of values:

<xs:enumeration value="PreservedSpecimen"/>
<xs:enumeration value="LivingSpecimen"/>
<xs:enumeration value="FossileSpecimen"/>
<xs:enumeration value="OtherSpecimen"/>
<xs:enumeration value="HumanObservation"/>
<xs:enumeration value="MachineObservation"/>
<xs:enumeration value="DrawingOrPhotograph"/>
<xs:enumeration value="MultimediaObject"/>
<xs:enumeration value="AbsenceObservation"/> 
tucotuco commented 3 years ago

I have updated the term name to comply with Darwin Core convention of using upper camel case for Class names. I have also noted where I have made minor modifications. One thing that needs to be change is the "Examples" content. As a class, what is needed here are strings describing entities that would be considered members of the class rather than a reference to a specific instance of one. Please revise.

Note that if this proposal is ratified, it would be best to add MaterialCitation to the Examples for basisOrRecord. A separate, provisional issue could be made for that.

myrmoteras commented 3 years ago

I have updated the MaterialCitation examples with respective strings.

tucotuco commented 3 years ago

I have updated the MaterialCitation examples with respective strings.

@myrmoteras Those are examples of content for a property rather than examples of what kinds of things constitute a MaterialCitation. Have a look at the examples given for other Darwin Core classes to get an idea of what is sought here. We're looking for something more like, An Occurrence documented in a taxonomic treatment in a journal article, An Occurrence mentioned in a field notebook etc.

myrmoteras commented 3 years ago

@tucotuco is this better now? I made an attempt.

tucotuco commented 3 years ago

That should work, thank you @myrmoteras.

deepreef commented 3 years ago

I'm not 100% sure I understand the scope of this proposed new class, but if it is for what I think it is for, then I strongly support it. We have discussed the idea of an "Evidence" class to represent a superclass of things that represent evidence supporting either an instance of Occurrence or an instance of Identification. Among the "things" that can serve in this "Evidence" capacity that we track include:

If I understand this proposal correctly, it would effectively represent the last item on the list above, and the one that is not already accommodated (at least partially) by existing DwC classes. If my interpretation is correct, then I would suggest adjusting the definition to something like:

"Instances of organisms recorded within published or unpublished references that serve as evidence of occurrence and/or taxonomic identification."

I would also add under the Examples, something along the lines of:

In our implementation, we create "virtual" instances of organisms for any report of a taxon at a locality. For example, if a publication provides a list of species for a particular region, without any specific evidence (specimen, image, specific observation), then we create an Occurrence instance, linked to a "virtual" organism, and anchor that Occurrence instance (via an Evidence instance) to the publication. I'm assuming something like this would be considered within scope of MaterialCitation?

tucotuco commented 3 years ago

@deepreef You have the sense of it correct. Your recommendations for amending the term look good to me. If @myrmoteras agrees, we will make those changes before sending this out in a pending massive public review.

myrmoteras commented 3 years ago

I don't understand though "taxonomic identification" in this context. May be @deepreef can explain? Otherwise I agree with @deepreef

mguidoti commented 3 years ago

I think, and I can be wrong, that any identified material citation is a statement of a taxonomic identification - someone is saying that the studied material belongs to a specific taxon concept that is currently known by a specific label (taxon name). Also, not all of these material citations will have geographical data - this is odd, somehow rare, but true.

Thus, by defining it as '...as evidence of occurrence and/or taxonomic identification' I think @deepreef nailed it.

tucotuco commented 3 years ago

I think I understand the confusion introduced by the reference to Identification. It suggests that the term could be used for multiple contexts, and that wouldn't be good for semantics. As I understand it, @myrmoteras meant this as written material as evidence of an Occurrence. The Occurrence carries an Identification with it by definition - it posits the presence or absence of an Organism identified as a member of a Taxon at a place and time. The confusion would be if this allowed for written material about the (re)identification of an Organism alone - in other words, it is not positing the evidence of an Occurrence but rather an opinion about an Identification alone. I think it should be made clear that it is not the latter.

deepreef commented 3 years ago

Arghh!!! I had written a detailed comment, then managed to kill the browser before posting... so that's 15 minutes I'll never get back. OK, here's the short version:

I tossed in the "and/or taxonomic identification" bit specifically as a way of paving the path towards a future "Evidence" class/concept. This is something that @baskaufs and I and others have been discussing for several years (including some epically long emails in recent days), and is best represented in this diagram, as "Token" (="Evidence"). The idea behind Token/Evidence is that it is (sort of) a superclass for things like LivingSpecimen, PreservedSpecimen, FossilSpecimen, HumanObservation, MachineObservation, MaterialSample, various multimedia files (images, videos, sound recordings, telemetry tracking devices, etc.), and published reports of organisms. That last bit is why I'm enthusiastic about creating MaterialCitation, to go along with the other semi-classes in DwC that serve the function of Token/Evidence.

If you look at the Darwin-SW diagram linked above, you'll see that Token (=Evidence) can serve two roles:

The description of the new term MaterialCitation focuses on the "evidence of occurrence" part, but it can also function as "evidence of Identification". For example, an in-situ video clip can simultaneously represent evidence that "this organism was here", and also "this organism should be identified as taxon X". Similarly, a published treatment can represent evidence that "organisms occurred at this place", and also "organisms belong to this taxon" (e.g., via morphological or genetic characters included within the treatment).

Maybe it was an over-reach at this stage to include the "and/or taxonomic identification" language in the definition/description of MaterialCitation; in which case it's fine to remove those words from the definition text. If so, we can address it later, whenever the "Evidence" class discussion takes place.

I hope that makes sense... not as clear as my original version, but no time to re-write it carefully again.

tucotuco commented 3 years ago

That makes sense to me, but I reiterate my concern over dual semantics. It would be better to have a class for each one. And if we want to anticipate an Evidence class, why not call this term WrittenEvidenceOfOccurrence, and the other one WrittenEvidenceOfIdentification?

On Tue, Apr 20, 2021 at 7:49 PM Richard L. Pyle @.***> wrote:

Arghh!!! I had written a detailed comment, then managed to kill the browser before posting... so that's 15 minutes I'll never get back. OK, here's the short version:

I tossed in the "and/or taxonomic identification" bit specifically as a way of paving the path towards a future "Evidence" class/concept. This is something that @baskaufs https://github.com/baskaufs and I and others have been discussing for several years (including some epically long emails in recent days), and is best represented in this diagram https://raw.githubusercontent.com/darwin-sw/dsw/master/img/dsw-1-0-graph-model.png, as "Token" (="Evidence"). The idea behind Token/Evidence is that it is (sort of) a superclass for things like LivingSpecimen, PreservedSpecimen, FossilSpecimen, HumanObservation, MachineObservation, MaterialSample, various multimedia files (images, videos, sound recordings, telemetry tracking devices, etc.), and published reports of organisms. That last bit is why I'm enthusiastic about creating MaterialCitation, to go along with the other semi-classes in DwC that serve the function of Token/Evidence.

If you look at the Darwin-SW diagram https://raw.githubusercontent.com/darwin-sw/dsw/master/img/dsw-1-0-graph-model.png linked above, you'll see that Token (=Evidence) can serve two roles:

  • Evidence of Occurrence
  • Evidence of taxonomic Identification

The description of the new term MaterialCitation focuses on the "evidence of occurrence" part, but it can also function as "evidence of Identification". For example, an in-situ video clip can simultaneously represent evidence that "this organism was here", and also "this organism should be identified as taxon X". Similarly, a published treatment can represent evidence that "organisms occurred at this place", and also "organisms belong to this taxon" (e.g., via morphological or genetic characters included within the treatment).

Maybe it was an over-reach at this stage to include the "and/or taxonomic identification" language in the definition/description of MaterialCitation; in which case it's fine to remove those words from the definition text. If so, we can address it later, whenever the "Evidence" class discussion takes place.

I hope that makes sense... not as clear as my original version, but no time to re-write it carefully again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/329#issuecomment-823647313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ7233BDEQHJTY47J5BWLTJYAIHANCNFSM42YYRVHA .

deepreef commented 3 years ago

I see where you're coming from, but if your concerned with dual semantics, then I would simply drop the "and/or taxonomic identification" bit from the definition. I think it's better to keep MaterialCitation consistent in general nature with the other evidence-like classes of DwC. I would avoid getting too deep into the weeds with concepts like WrittenEvidenceOfOccurrence and WrittenEvidenceOfIdentification at this stage. Better, I think, to stick with MaterialCitation in keeping what @myrmoteras was originally proposing, then when we get serious about Evidence, perhaps then consider the split EvidenceOfOccurrence and EvidenceOfIdentification separately. I already discussed this in more detail in the recent email to you and @baskaufs; for now, I think we're on the right track in establishing MaterialCitation in parallel with the other "tokens" that serve in an evidentiary capacity.

wouteraddink commented 3 years ago

On behalf of CETAF ISTC (a group of informatics experts representing the CETAF and DiSSCo community in Europe): the group supports and recommends the implementation of this proposal.

AaronWilton commented 3 years ago

I support the addition of this Class - as necessary to support current and emerging research.

tdikow commented 3 years ago

I support the addition of this Class. MaterialCitation will allow freeing up specimen occurrence data from past publications. These data will often involve specimens sitting in collections but that have not been digitized yet.

deepreef commented 3 years ago

These data will often involve specimens sitting in collections but that have not been digitized yet.

This might be a tangent to this issue, and if it's not sufficiently relevant, please ignore. However, I've been meaning to raise an issue with respect to generating duplicate/redundant Occurrence instances based on MaterialSamples, and cited MaterialSamples in the form of MaterialCitations.

I was reviewing the GBIF Occurrence records for one of my favorite species, and it turns out that there are 21 of them. However, there are actually only 7 Occurrence records represented. Six of the 21 records in GBIF come from the institutions that hold the respective specimens, 7 of them have been separately published by OBIS (harvested from the publication), and 8 have been separately published by PLAZI (also harvested from the publication). Coincidentally (or not?) the one PLAZI published twice is the one that was not published by the original institution that actually holds the specimen (Western Australian Museum).

I bring this up here, in this context, to encourage mechanisms whereby the generation of new MaterialCitation instances do not artificially manufacture duplicate/redundant Occurrence instances. At the moment, two-thirds of the records for Chromis abyssus in GBIF are drawn from MaterialCitation evidence, and as a consequence we have 3 times as many Occurrence records represented in GBIF as there are actual Occurrence instances. I don't know how pervasive this sort of duplication is in GBIF, but I worry that it will continue to expand if all specimen records cited in publications serve as the basis for generating new Occurrence instances where those Occurrence records are already represented by PreservedSpecimen records.

Of course, I have no idea how people who harvest specimen data from publications (e.g., PLAZI) will be able to reliably discover existing matching Occurrence records in GBIF (I'm sure Guido can figure something out!), so it may be the lesser of evils in this phase of biodiversity informatics history to generate redundant records rather than ignore published records. But I think this is an issue that the minters of MaterialCitation instances should at least be cognizant of.

tdikow commented 3 years ago

@deepreef I entirely agree with you that duplicate/redundant Occurrence records can be an issue when MaterialCitation data are gathered from publications. Matching records from MaterialCitation to a record already in GBIF (during the data capture such as through Plazi) seems quite complex but could hopefully be accomplished especially if we use/cite specimen identifiers on the physical specimen that are unique way beyond our own institutions ('globally' unique - although not in the strict sense; I would argue that a specimen identifier from my institution such as USNMENT01234567 is 'globally' unique). In addition, GBIF is working on a clustering algorithm to group similar records and the hope would be that your example of 21 Occurrence records and only 7 PreservedSpecimen records would be matched to each other to reflect the correct number of physical specimens known to scientists. I live in the insect world and there are Millions and Millions of specimens in museum collections that have not yet and will not in the near future be digitized. Providing access to specimen occurrence data from peer-reviewed and published articles can bring to light an enormous number of insect records that can be used for all sorts of analyses by being made available in GBIF. This argument doesn't solve the duplicate/redundant Occurrence records problem, but it supports the notion that insect collection digitization is lacking behind and in most cases a MaterialCitation record from Plazi from a recently published article might be the first and only Occurrence of that species in GBIF. Looping in @myrmoteras and @mguidoti .

deepreef commented 3 years ago

Thanks, @tdikow ! I'm actually not worried about PLAZI per se (they know what they're doing) -- I'm much more worried about other initiatives that might start harvesting MaterialCitation instances from literature, but without the experience PLAZI has. We can try to link up records through the trusty "DwC triplet" (institutionCode+collectionCode+catalogNumber), but I suspect that will only be reliable a fraction of the time.

I also have faith that GBIF clustering efforts will deal with a lot of this sort of thing. And I fully agree with you that the benefits of harvesting otherwise "dark" specimen records from MaterialCitation instances will (vastly?) outweigh the cost of a few duplicate records here and there (i.e., the "lesser of two evils"). I mostly wanted to shine a spotlight on this issue for would-be harvesters of literature-based specimen records.

As for me, I'm actually much more excited about minting MaterialCitation records for literature reports that are not backed up with vouchers. Indeed, in my neck of the evolutionary tree, those will likely far outnumber the otherwise "dark" PreservedSpecimen instances hiding in literature that are not already represented from source Museums. We in the coral-reef fish world have lots of unvouchered reports of Occurrences in publications that will help us properly document geographic distributions with far greater granularity than the vouchers alone could provide.

myrmoteras commented 3 years ago

There is another aspect of MaterialCitation: The link to the context where the specimen has been used. This can be an article, or in the Plazi case a taxonomic treatment. This connection MaterialCitation-treatment-article allows to make statments such as:

MaterialCitation function like a transaction record for an authoritative identification of a specimen.

As stated above, MaterialCitation are about a specimen, occurrence, not an occurrence per se.

deepreef commented 3 years ago

As stated above, MaterialCitation are about a specimen, occurrence, not an occurrence per se.

EXACTLY! In other words, "Evidence of Occurrence" (and/or "Evidence of Identification"). In other words "Token" sensu Darwin-SW.

My only point is that the minters of new MaterialCitation instances should be cautious about automatically minting new Occurrence instances (i.e., minting new occurrenceID values, when such IDs may already exist).

ekrimmel commented 3 years ago

For fossilSpecimen occurrences in GBIF at least 3,575,489 records out of 11,611,790 would most likely fall within the proposed materialCitation class. This update would have a significant impact on the clarity of the fossil specimen occurrence records vs. literature occurrence records (e.g., from the Paleobiology Database dataset) and is very encouraged from the collections data provider point of view for many of the reasons already noted in this thread.

Erica Krimmel, Holly Little (@hollyel), and Talia Karim (@tkarim) (on behalf of the Paleo Data Working Group)

Archilegt commented 3 years ago

I routinely reverse-database old published catalogs of collections, in order to discover missing types and other specimens. That reverse digitization is also valuable for institutions which don't have complete collection databases. I am happy to use MaterialCitation for reverse-digitization outputs, as there was no specific term for this. I am also imagining a use case in which a database manager could add MaterialCitation(s) to the "complete" institutional database, meaning that a specimen currently known only from the literature is added to a database that is otherwise considered complete. I have done this with the MCZBase staff, for example. Now those literature records could be displayed as MaterialCitation. Please, keep the definition comprehensive. That is, material citations to come not just from the scholarly literature, but also from unpublished museum catalogs, field notes, etc. The definition "A reference to biological material (e.g., a specimen, an observation) including those used as the basis of a taxonomic description." (https://terms.tdwg.org/wiki/tp:material-citation) looks good to me.

tucotuco commented 3 years ago

@Archilegt This term is now in production. You can see the final definition, usage comments and examples (which includes field notebooks) at https://dwc.tdwg.org/terms/#materialcitation.

Archilegt commented 3 years ago

Hi @tucotuco Thank you for the link. Please, note that the narrower definition ("...in scholarly publications") is contradicted by the last example ("...in a field note book"). That will cause confusion and we will then have people asking "Why field note book, if it is not a scholarly publication?". I think that we do want the field notes and collection catalogs included in the MaterialCitation class. That leaves us with no other option that to change the definition. Or is there another solution to the definition vs. example contradiction?

tucotuco commented 3 years ago

For consistency, I think the definition has to change. I don't think there was any opinion counter to including permitting more than scholarly publications, even if that was the primary motivation for the creation of the term. However, given that the definition is normative and the usage comments and examples are not, this has to be a formal change with associated review.

On Wed, Aug 4, 2021 at 7:48 PM Archilegt @.***> wrote:

Hi @tucotuco https://github.com/tucotuco Thank you for the link. Please, note that the narrower definition ("...in scholarly publications") is contradicted by the last example ("...in a field note book"). That will cause confusion and we will then have people asking "Why field note book, if it is not a scholarly publication?". I think that we do want the field notes and collection catalogs included in the MaterialCitation class. That leaves us with no other option that to change the definition. Or is there another solution to the definition vs. example contradiction?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/329#issuecomment-893024169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72ZA5XT4PP7QWAPQYULT3G7U3ANCNFSM42YYRVHA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

deepreef commented 3 years ago

I agree -- it was an oversight to restrict the definition to "scholarly publications" (a term with an entirely arbitrary and ambiguous meaning). The term I use in such circumstances is "published or unpublished documents", or "published or unpublished references".

tucotuco commented 3 years ago

Since this particular term addition is already a done deal. We'll need a new Term - change request. From here:

How can I propose a change to the Darwin Core standard?

Changes can be proposed at any time. There are three requirements that must be satisfied in order to justify that a change be considered, the demand requirement, the efficacy requirement, and the stability requirement. Be prepared to satisfy these requirements if you want to make a change. The details of these requirements can be found in Section 3.1 Justifications for Change in the Vocabulary Maintenance Specification.

To initiate the change request, the preferred method is to create a new issue (https://github.com/tdwg/dwc/issues/new/choose) in the Darwin Core Issue Tracker. Use the appropriate template to create a new issue and fill it out as clearly, concisely, and completely as possible. Don't worry, if there are things you can not provide, the Darwin Core Maintenance Group will help you to develop the issue to maturity. If you can not or do not wish to use the Issue Tracker, please send the proposal in a message to tdwg-content@lists.tdwg.org.

myrmoteras commented 3 years ago

It was intentional to restrict MaterialCitation to scholarly publications. This is where an expert provides and communicates his opinion about the identify of a specimen as part of his research generally in form of, and as part of, a taxonomic treatment, from where it is often linked to other data.

Scholarly publications are well established through the way scientists work, building a corpus of hundreds of millions of printed pages resulting in more than 15,000 treatments of new species published every year, and a multiple of annotating previous treatments with new results in reference to the published work.

There has always been discussions on what to consider a scholarly publication. In the digital age, there are discussions what scholarly publications are, in what format they have to be published.

One important change in the digital age is that digital copies of works exist that can be cited and accessed from everywhere at any time and anybody, and that data within such work can be made open findable, accessible, interoperable and reusable (FAIR). Whilst the scholarly publications where produced with the intention to disseminate results as widely as possible, field note books where meant as a complement to fieldwork as basis for further analyses. However, in the digital age, a digitized field notebook can be as easily made accessible as a scholarly publication, including notes about a specimen.

MaterialCitation in the now defined sense can be considered a specific kind of MaterialCitations in a more general sense as something citing a specimen, as they can occur in fieldbooks, or even emails.

In the above sense, they are the result of a research project and thus have a special status. They are often the only evidence of the presence of a specimen (see e.g. the contribution to the long tail data of species only once mentioned in GBIF), and their identification is vouchered by the source treatment or publication.

It thus does makes sense to keep them in a format so that they can be retrieved, but agreeable they could or should be included in a category “MaterialCitation in the wider sense”

deepreef commented 3 years ago

It was intentional to restrict MaterialCitation to scholarly publications.

Are you saying that you oppose the revision of the definition of MaterialCitation in DwC to broaden its scope to include unpublished sources of information? If so, I strongly disagree (i.e., I support revising the definition to generalize it).

I think restricting the source of MaterialCitation instances to "scholarly publications" is a very bad idea. In my experience, field notebooks and other non-published documents contain just as much valuable information from experts as published works do (sometimes more), so it would be a mistake to exclude such resources from global biodiversity data exchange. (Or to characterize in a different way at the class level).

Obviously, scholarly publications will form the bulk of instances of MaterialCitation records, and that's great. But there is no rational basis that I can think of to actively exclude unpublished documents as sources of information for taxonomic and other biodiversity information. In some ways, this would similar to restricting MaterialSample instances to only specimens housed in public museums, and excluding any instances housed in private collections or other repositories.

As with the public museum vs. private collection example, the threshold at which a document represents a "scholarly publication" is subjective and arbitrary. There would be endless debates about whether some records surpass the threshold for inclusion vs. exclusion.

However, in the digital age, a digitized field notebook can be as easily made accessible as a scholarly publication, including notes about a specimen.

Yes, exactly! So what advantage is there to explicitly exclude such valuable (and accessible) sources of biodiversity information?

My enthusiasm for supporting the MaterialCitation class is because it opens an important door to biodiversity data provenance by representing References (sensu lato) as a source of Evidence. Just as preserved specimens can serve as critical provenance/evidence of organism occurrences and taxonomic identifications, so too can documents (published or otherwise).

If you want to be able to restrict query results to scholarly publications, then perhaps some terms could be added to flag "scholarlyPublication" or "peerReviewed" or "singleCopyDocument" or whatever metric(s) you want to use to qualify different sources. But it seems absurd to me to exclude valuable sources of printed (paper or electronic) information based on whether or not it represents an arbitrary distinction of "scholarly publication".

debpaul commented 3 years ago

@deepreef wrote:

My enthusiasm for supporting the MaterialCitation class is because it opens an important door to biodiversity data provenance by representing References (sensu lato) as a source of Evidence. Just as preserved specimens can serve as critical provenance/evidence of organism occurrences and taxonomic identifications, so too can documents (published or otherwise).

Yes ^^^ my enthusiasm too. I cannot like this enough. Finally a place for references to grey literature, for example. I think this was in the original thread. Maybe it just got lost in translation?

tucotuco commented 3 years ago

Done.

tucotuco commented 3 years ago

Note: I closed this issue to signify that the change request went through the public review process and resulted in the ratified term as defined here and in the Quick Reference Guide. Comments including and after https://github.com/tdwg/dwc/issues/329#issuecomment-894325931 should be carried to the new issue https://github.com/tdwg/dwc/issues/372.