tdwg / material-sample

A Task Group of the Observations and Specimen Records (OSR) Interest Group
2 stars 0 forks source link

Other Deliverable - BasisOfRecord review #11

Closed Jegelewicz closed 2 years ago

Jegelewicz commented 2 years ago

Task Group will make a recommendation [...] as to which class in the Darwin Core standard these properties belong which may also include recommendations for terms being revised, added, disambiguated, or deprecated. Depends upon definitions provided [in primary deliverable]. [...] Recommendations will be provided for a revised formal definition as it pertains to materialSample but will not consider other data types.

Current Darwin Core Placement/Definition

http://rs.tdwg.org/dwc/terms/basisOfRecord

this term is a property of Record-level

Defintion

The specific nature of the data record.

Examples

PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation

Comments

Recommended best practice is to use the standard label of one of the Darwin Core classes.

See also

umbrella issue related to dwc:basisOfRecord and an Evidence class: https://github.com/tdwg/dwc/issues/302

timrobertson100 commented 2 years ago

When dealing with Material Samples specifically using the Darwin Core Archive data format (i.e. star schema data structure used by GBIF/OBIS etc) I feel we have to make a decision between one of these options:

  1. We continue to use an occurrence record (rowType=Occurrence) for each record representing an item of preserved material. Issues with this approach relate to confusion (does the row represent the occurrence of the species in nature, or the material evidence of that, or some join across those or...) and to record identification (what is occurrenceID identifying, noting that GBIF has enforced uniqueness constraints on occurrenceID?). basisOfRecord is required here to distinguish "what" kind of occurrence the row is representing.

  2. We introduce a new type of row (rowType=MaterialSample) and create a row for each material sample. When using this option materialID would uniquely identify the row and basisOfRecord would be unnecessary.

  3. We promote more use of an event-oriented archive (rowType=SamplingEvent) where the core of the star represents the gathering event or action of subsampling (e.g. extract of a leg) and an extension captures the material record. The extension could be an occurrence, which exists as a schema today or we create a new material one. This has issues, as you cannot today(!) have a third level of extension to capture multiple identifications of the specimen, or measurements, or other associated data with the material.

I think we need a decision on this to know what to do with basisOfRecord.

  1. If we use rows of type Occurrence then basisOfRecord is required to categorize the nature of the row (an observation or a specimen etc). We should then review if the current vocabulary is adequate or should be expanded.

  2. If we opt for rows that are of type Material, then basisOfRecord can be ignored in this task group.

Does this make any sense to others, please?

I'd like to understand why it was made required in the IPT with a controlled vocabulary and if that provided the kinds of results people were expecting when that decision was made.

It is necessary for users to perform basic filtering on GBIF e.g. to exclude living specimens and fossils from a map. While not perfect it covered most of the basic categorization needs for search and reporting (removing ex-situ records is still an issue in some cases).

albenson-usgs commented 2 years ago

Seems like 2 is how people would be expecting to approach this and therefore might have the easiest time being adopted if the technological solution for it isn't too difficult.

Thank you for explaining the rationale for basisOfRecord being required by the IPT. Michael Hope gave similar reasons for the Atlas and from my perspective we will need to keep using basisOfRecord this way and introduce a new term to indicate types of evidence.

deepreef commented 2 years ago

I was not able to join the second meeting yesterday, as I had hoped. Is there a link where I can access the recording (if it was recorded)?

There are several separate but related issues swirling around basisOfRecord:

  1. How it's framed and defined in DwC vs. how it's used in IPT vs. how content providers actually assign values to this term. One of our priorities should be to harmonize these.
  2. Whether it maintains its role as indicating the class/subclass of record in a sensu lato way (as indicated by the example values in the documentation and its organization with Record-level terms), or whether it is re-framed as something along the lines of "type of evidence" specifically for Occurrence records (in which case it should be organized within the Occurrence class terms, and re-defined with appropriate examples accordingly).
  3. Whether other, more explicitly defined terms (e.g. specimenType, evidenceType, or occurrenceType, occurrenceEvidence, etc.) are needed to accommodate the various needs, and if so, how basisOfRecord is re-defined accordingly (or deprecated entirely).
  4. Probably other stuff.

@timrobertson100 : Although I support continuing to maintain Occurrence as the "primary core" (mostly because that's what a lot/most of the data consumers are looking for), I'm uneasy about option # 1 because it perpetuates the problem we currently have, which is forcing non-flat data about multiple things (e.g., Occurrence+MaterialSample+Identification+Taxon+Event+Location+etc.) into a flat structure, as though all these things exist simply for the purpose of documenting instances where organisms occur in nature. While that may (currently) be the primary need for existing data consumers; obviously things like MaterialSample records have other use-cases besides just representing the Occurrence instance at which they were extracted from nature.

I generally support # 2:

We introduce a new type of row (rowType=MaterialSample) and create a row for each material sample. When using this option materialID would uniquely identify the row and basisOfRecord would be unnecessary. but with a couple of caveats:

  • I trust you mean materialSampleID (not materialID)?
  • basisOfRecord might be unnecessary in the context of DwCA, but you'd still want the ability to distinguish between preserved vs. fossil vs. living specimens; and also basisOfRecord still could have value among the Record-level terms for indicating things like Occurrrence vs. MaterialSample vs. Taxon vs. Location vs. Event vs. etc.
  • I would assume/hope that Occurrence would still be a core in IPT, in which case MaterialSample could also be represented as an extension (as well as a core)?

I definitely think there is a pathway where Event becomes the "primary" core; but I'm not sure this solves much unless people start minting Event records for many other kinds of events besides collecting\observing. At the moment, our community pretty-much only mints events in the context of Occurrence instances (in the form of Organism-at-Event); but having Event as the primary core would make more sense if it had broader use in other Location-at-Time things (e.g., Organism-was-identified-as-Taxon events).

The most powerful solution would be to have ResourceRelationship as the primary core, serving as a "fat" triple-store of relationships among records (i.e., the ultimate join table), and then have each of the other DwC classes represented as extensions (effectively flattened cached values for the properties in each class where parity of property value to object is 1:1). This would get us a step closer to a future world where everything is serialized as RDF (I'm just term-dropping here -- I don't really understand what I'm saying). However, this is probably a bridge too far at this stage in the history of DwC and IPT; so it might be asking too much to make that leap right now (although if anyone is interested in exploring it, I'll gladly generate some sample datasets in this form).

If we use rows of type Occurrence then basisOfRecord is required to categorize the nature of the row (an observation or a specimen etc). We should then review if the current vocabulary is adequate or should be expanded.

As I noted above, in this scenario I would also explore a new term (something like occurrenceType or occurrenceEvidence or basisOfOccurrence or something) to function in this capacity, and refine basisOfRecord as a higher-level class designator (and left among the Record-level terms).

It is necessary for users to perform basic filtering on GBIF e.g. to exclude living specimens and fossils from a map. While not perfect it covered most of the basic categorization needs for search and reporting (removing ex-situ records is still an issue in some cases).

I think this is the genesis of the overloading of the term (if, indeed, there is agreement that it's overloaded). Because we didn't have a term to do what is needed for this kind of filtering (specimen vs. observation; living vs. preserved vs. fossil specimens; human vs. machine observations), basisOfRecord was the best-fit existing term to use for this purpose. This is why I still support the idea of a new term (or terms) to explicitly capture information used for this kind of filtering, and relegate basisOfRecord to a more generalized purpose. I agree with what @tucotuco said in the chat of our first meeting yesterday that conceptually it's fine to scope basisOfRecord to accommodate n-levels of sub-classing; but that doesn't mean I think it's the most useful way to structure DwC and scope this term.

Apologies for the long post.

deepreef commented 2 years ago

Crap... there's one more point I wanted to make, but forgot.

So, I think the reason why Occurrence was the natural "core" for a flattened system like DwCA is that it was, at the time, the most "granular" item among the DwC classes. Generally when one flattens a normalized data model, the rows of the flattened dataset represent the most granular nugget of information, and all the other properties are inherited from various 1:many associated relationships. That obviously breaks for many:many relationships (e.g., multiple Identification instances); but we can usually cheat those (e.g., by just defaulting to the "most current identification").

If we explore the idea of establishing an Evidence class in DwC (which is outside the scope of this task group, I know...), then I can see a pathway where it represents the most granular representation of our data (at least from the perspective of consumers who are primarily interested in occurrence data). What this means is that the "core" record in GBIF would become an instance of Evidence, and there may be multiple rows that share the same occurrenceID (in cases where multiple lines of evidence support the same occurrence).

In a sense, that's already happening in GBIF whenever the same occurrence is represented multiple times (e.g., by a MaterialSample record and a MaterialCitation, and maybe one or more multimedia items, etc.)

I'm not a fan of flattened data representations, but if there is value in maintaining that approach, then this might be one way to address some of the issues.

deepreef commented 2 years ago

I was not able to join the second meeting yesterday, as I had hoped. Is there a link where I can access the recording (if it was recorded)?

Nevermind! I found it.

baskaufs commented 2 years ago

(If you are seeing this in an email notification, go to the GitHub issue if you want to see the diagram.)

Assuming that we all recognize that the star-schema system is a "band-aid" (that is, an insufficient system used in the absence of something better being available), it seems to me that we should design the star in a way that simultaneously meets as many needs as possible. I am going to propose an alternative to what @timrobertson100 suggested: an organism core.

IMG_3445

In this model the core table is organism and extensions are occurrence, material sample, and identification. Just to clarify, "core" here does not mean "more important" and "extensions" does not mean less important. Core means sitting in the center of the star and extension means surrounding the core.

We get rid of basisOfRecord. Each row in each table has an rdf:type column. Within the occurrence, identification, and organism tables, every row has the same type (the type of the table, i.e. dwc:Occurrence, dwc:Identification, and dwc:Organism). Within the materialSample table, the rdf:type column would contain a value for whatever class you'd like to apply to the sample in that row (dwc:PreservedSpecimen, dwc:FossilSpecimen, ex:TissueSample, ex:DnaExtract, or whatever). If it makes people happy, we could continue to use a rowType column and everyone would just understand that means rdf:type.

In the core (organism) file, the id field is an identifier for the organism.

In the occurrence table, the coreid field links to the organism identifier -- many occurrences may apply to one organism (data logging, camera trapping, etc.) or there may be only one (in the case of collected specimens). Data fields in the occurrence table will include dwc:recordedBy (and any other fields that really apply to occurrences and not samples), collapsed event/location data such as dwc:eventDate, dwc:decimalLatitude, dwc:stateProvince, etc.

In the identification table, the coreid field links to the organism identifier. In many cases there will only be one determination, but this structure would allow multiple determinations if they are available by having more than one line with the same organism identifier. Data fields would include dwc:identifiedBy and Taxon class terms.

In the MaterialSample table, the coreid field links to the identifier for the organism that the sample ultimately came from. The sample may be directly derived from the organism as in a preserved specimen or living specimen (in which case the living specimen would BE the organism, but that's irrelevant -- it's still derived from that organism). The sample may also be indirectly derived from the organism through instances of subsampling. Many samples to one organism are possible (duplicate specimens collected of the same tree, legs taking off of a whole insect mount, tissue sampled from a mammal, DNA sampled from tissue from a mammal, etc.) The fields in the material sample table would be fields that are really about samples, like dwc:preparations and dwc:disposition, but not fields that are actually about occurrences like location information.

In order to make links between the occurrence, material sample, and determination tables, each row would need to have an identifier that's unique within the dataset. Optimally they would be IRIs, but that wouldn't be a requirement. One could use dcterms:identifier for this column.

Subsampling relationships would be handled simply by having a field in the material sample table called something like "derivedDirectlyFrom". The derivedDirectlyFrom value for a row would be the organism ID if it was derived directly from the organism (e.g. blood samples, preserved specimens, etc.) or the identifier value from a different row in the material sample table if it was the result of a subsampling event.

If material samples are associated with occurrences (they wouldn't have to be), this could be documented with an "evidenceForOccurrence" column in the material sample table that contained the identifier for the occurrence it was associated with. This mechanism would allow many samples to be associated with a single occurrence, but not the other way around. However, the assumption that a material sample came from a single organism pretty much imposes this limit anyway.

If material samples are associated with determinations, this could be documented with an "evidenceForIdentification" column. This mechanism would allow for many samples to be associated with a single identification, but not the other way around. That would be limiting if the same sample (e.g. specimen) were the evidence for several determinations, but I suppose there would be some cludgy way to jury-rig this in the identification record.

The organism table would have almost no fields other than id (maybe dwc:organismName). That will probably seem weird to a lot of people given that it's the core table, but recall that core doesn't mean "most important". In this case, the organism table is basically serving as a join between the other three tables. The organism table is in the middle of the star because in the end, determinations, occurrences, and material samples are all telling us something about the organism. When we make a determination, it's a determination about the organism a leg or DNA sample came from, not a determination of the leg or DNA sample themselves.

This system is a bit more complicated than what is typical now, but that's offset by the ability to have a relatively simple model that is able to handle otherwise complicated things that we say we want to track like derived material samples, multiple determinations, resampling organisms over time, and linking multiple material samples to a single occurrence. It is also totally buildable with the existing Darwin Core archives/DwC Text Guide specification. It is also extensible because everything that I said about the MaterialSample extension table could also be said about a DigitalMaterial extension table, which could include media files and born digital electronic records. Those digital materials could be linked to occurrences, determinations, and other resources from which they were derived just like the material samples. It just would mean creating fourth extension table.

The other thing about this model is that it could very simply and easily be used to generate linked data. It's basically the center part of the Darwin-SW model (see http://bit.ly/2dG85b5 Fig. 1 for relational ER diagram or Fig. 2 for a graph diagram) and the fields I used above could be mapped to Darwin-SW object properties to generate RDF.

There are several common kinds of cores that this would not handle (Taxon and Event). But the other proposals don't address them either. By the way, I don't think it's a great idea to conflate occurrence recording events (dwc:Event instances) with subsampling "events" (one material sample being created from another). They would share some properties, but in my mind their roles are very different.

deepreef commented 2 years ago

@baskaufs : This is excellent! I'll spend some time digesting it. This week I will ty to make a similar sketch and description for the ResourceRelationship-core approach (which I was thinking extensively about this morning on my way in to work -- must be some kind of psychic connection).

One quick comment, though: the relationship between MaterialSample and Occurrence as "Evidence" is many-to-many; not one-to-many as shown in your diagram.

smrgeoinfo commented 2 years ago

@baskaufs do the rows in the organism table represent individuals or classes of organisms?

baskaufs commented 2 years ago

@smrgeoinfo individuals. But that would be individual organisms as defined by DwC, which can also include taxonomically uniform groups of organisms like packs, clones, etc.

Jegelewicz commented 2 years ago

It isn't clear to me that there needs to be a "core". If we create records of "rdf:type" and relate them as appropriate then why does anything need to be the core?

Jegelewicz commented 2 years ago

Also, isn't the "star schema" a GBIF/Darwin Core Archive thing? Is that something we should be concerning ourselves with? I am worried that we are working on GBIF issues (that we cannot resolve) instead of definitions of Darwin Core terms.

baskaufs commented 2 years ago

@Jegelewicz Darwin Core Archive is a specific implementation, but the "star schema" system with core and extension files is actually laid out in the Darwin Core text guide, which is officially a part of Darwin Core. It's not the only way to use Darwin Core, but it's probably the most common way.

The "core" and "extension" designation is terminology built into normative parts of the text guide specification. See Section 2.1.2 and beyond.

I get your point about

If we create records of "rdf:type" and relate them as appropriate then why does anything need to be the core?

That's really a kind of Linked Data argument and I agree with it totally. The reality is that there are tons of people using a system based on the Text Guide. So how do we move in the direction where we allow as many types of things to be represented in their own table (be distinct types) rather than "flattening" them into fewer tables and therefore losing the ability to create many one-to-many relationships?

My suggestion (put Organism in the middle of the star) was intended to maximize the number of distinct tables that could be handled by the existing star schema design laid out in the text guide. It does not allow for at least two other relatively common designs where events or taxa are put in the middle of the star. What I believe it does fix is the various problems people have with putting occurrence in the middle.

Jegelewicz commented 2 years ago

@baskaufs thanks for that explanation. I'm betting that next to zero collection managers know about or understand this. In fact, I can read it and sorta get what's going on, but I know I could not confidently explain it to anyone.

Given all of that - if we must choose a "core", I think that for museums, the "core" that makes sense is MaterialSample, although now that everyone has been trained to think of occurrence as core, that will be very difficult to change. I say MS for core because we are primarily managing physical objects that may or may not represent an organism (or many organisms) - often the organism is only implied. Having implied data be the "core" seems wrong somehow. I know this argument will not go over well, even among museums, who somehow still think that their mouse skull IS an "organism" and maybe they are right - who am I to say!

Maybe it really doesn't matter what the "core" is? There could be data sets with anything at the core as long as we have well-defined terms and row:types and I should (with a bit of work) be able to mash any of them together (I think?).

My head hurts....

deepreef commented 2 years ago

although now that everyone has been trained to think of occurrence as core, that will be very difficult to change

This is an issue/problem/concern that I've been aware of/trying to address for a long time (many years). The first step was to create the necessary classes in DwC (MaterialSample, Organism; still working towards something like Evidence) to parse out the "meat" (so to speak) of the ubiquitous Occurrence instances, a very large portion of which actually represent "the circumstances when a specimen was extracted from nature", but are often thought of as the specimen itself.

I'm actually very encouraged by the slow but steady progress to solve this, and the existence of this Working Group is a VERY important step in the right direction. I guess my main point of reassurance here is that there is a lot of "inertia" in our very broad community, so fundamental shifts do require time. While this shift has been years in the making, the ship is definitely turning, so I'm actually kind of excited and optimistic that we're on the right track.

There are several general ways we can move forward on this:

  1. Baby Step: We define MaterialSample as another Core for the existing star schema architecture, and represent occurrences associated with those MS instances as an Extension.
  2. Moderate Step: We define Organism as another Core for the existing star schema architecture, as per @baskaufs diagram/description.
  3. Big Step: We define ResourceRelationship as "the" Core for the existing star schema architecture, as I have suggested (and will eventually describe and illustrate).
  4. Giant Leap: We retire the star schema architecture, and develop a new architecture to move structured packages of our data around (to GBIF, iDigBio, and elsewhere).

There are other options in there as well (e.g., introducing an Evidence class and representing that as the Core).

The key question, I think, is how big of a step is our broader community willing to take at this stage? I'm very confident that we're ready to at least take the Baby Step. I'm likewise very skeptical that we're ready to make the Giant Leap. But it's less clear whether the Moderate or Big Steps are practical/realistic.

Obviously, some of this is beyond the scope of this Task Group. But this Task Group is sort of at the epicenter of the larger issue -- which ultimately boils down to finding the right balance between flat/simple data structure vs. highly complex/normalized data structure at the exchange level. The star schema is a compromise between one end of the spectrum (simple, flat table of Occurrences) and the other end (RDF triple store). There are other options representing compromises at different stages along the same spectrum. The trick is to find the sweet spot, then take the steps necessary to help shift the community in the right direction.

Yes, my head hurts too. But my heart also beats (with excitement about the prospects of real progress in a long journey)!

baskaufs commented 2 years ago

@Jegelewicz In response to your recent comment, I just want to emphasize that I think we are using the term "core" in two distinct ways.

In your comment, I believe you are using the term "core" to mean "the table that contains information about the kind of thing we think is most important in a particular community". When I suggest that Organism should be the core table, I intend for "core" to have the technical meaning it is given in Section 2 of the DwC Text Guide: the table that sits at the center of the star in the star schema.

In most current cases, I think that the various available "cores" (occurrence, event, taxon) position the table for what a community considers the "most important thing" ("core" in @Jegelewicz sense) in the "core" position in the star ("core" in @baskaufs sense). What I am advocating is that in the interest of making it possible to document the more complex kinds of relationships people want, we get away from this "center of the universe" thinking (i.e. the "most important table" has to be in the center of the star). If we use Organism as the "core" file in the star, it is actually likely to be the LEAST important table in the star -- it's just the table best positioned to link many of the other ones that people do think are important (occurrences, material samples, media items, and identifications) and still keep the existing star schema system.

baskaufs commented 2 years ago

@deepreef in response to your comment, I think your listing of possible steps based on how big they are is a good statement of the situation. The one thing that I would say in response is that I don't think that it is clear that there is a benefit to having a distinct "evidence" class. Despite what we did in defining a Token (a.k.a. Evidence) class in Darwin-SW, I don't really think that there is anything to be gained from it.

There is a benefit to be gained when a resource of any type (e.g. MaterialSample, image, sound) is identified as evidence through linking it to an occurrence or an identification. In other words, any kind of thing becomes evidence when we assert that it serves as evidence (using yet-to-be-defined property terms). We could create some domain assertion for those properties that automatically entail that the thing used as evidence is an instance of an "Evidence" class, but what would be the benefit? I think that people look at types more to understand what kind of thing something is (material object, image, etc.) rather than what role it plays. If you want to know whether it has an evidentiary role, look to see if it has an "isEvidenceFor" property.

If you are a visual type person (like me) and these words confuse you, look at this picture:

Darwin-SW graph diagram

Imagine that we have a MaterialSample (maybe a museum specimen) in the position of the diagram labeled dsw:Token. If it has a value for the property dsw:evidenceFor, then we know that it's evidence for an Occurrence. If it has a value for the property dsw:isBasisForId, then we know that it's evidence for an Identification. Why do we need to check if it has rdf:type of dsw:Token? That's superfluous and it's much more useful to know that it's type is dwc:MaterialSample or even more specifically dwc:PreservedSpecimen.

This is a sort of "Occam's razor" argument -- why create a class if it doesn't add anything to our understanding of the resource.

deepreef commented 2 years ago

The one thing that I would say in response is that I don't think that it is clear that there is a benefit to having a distinct "evidence" class.

I actually agree -- which is why I've not been beating the "Evidence Class" drums very much lately. When we did our implementation of Evidence-stuff, it was clear that each "thing" ("token"?) that functions as evidence (e.g., for an Occurrence, or for an Identification, or possibly for an Event, GeologicContext, MeasurementOrFact, etc.) is not, by it's nature "Evidence". Rather, these things/tokens are entities of their own (e.g., MaterialSample, multimedia item, literature documentation, documented report, etc.), each with their own specific class properties. What makes them function as "Evidence" is the relationships they have with these other things (Occurrence, Identification, etc.). Basically, tangible things serve as "evidence" for abstract instances.

So... the "Evidence-ness" really ought to be represented as instances of the ResourceRelationship (which I am increasingly coming to view as the universal "many-to-many join" for links between -- and even recursively within -- instances of other DwC classes).

I'm not sure if we're basically saying the same thing here, but when I look at the dsw diagram, I think dsw:Token is really sort of a placeholder to represent not "any" class, but rather the subset of entities that have tangible (physical or digital) manifestation. Currently, the only DwC class we have in that space is MaterialSample, but multimedia and (potentially) literature are two others that could fit that scope.

Organisms/Agents (which I see as fundamentally the same thing, with the latter restricted to a single particular taxon) are partly tangible (they have physical manifestation), but partly abstract (an Organism represents the dynamic set of matter and kinetic/chemical action that spans more or less from birth to death/disintegration). The same thing could probably be said for instances of Location (which seem to be at the intersection of tangible and abstract). But I don't see these as things that themselves function as evidence, nor are they really in need of being supported by evidence, so I'm not sure they operate in the same way.

OK, that's enough Sunday-morning armchair philosophizing....

baskaufs commented 2 years ago

@deepreef Yes, it sounds like we are pretty much in agreement.

I'm pretty much agnostic about the mechanism for documenting many-to-many relationships. In the "organism core" scheme I suggested, I treated relationships between "evidence" like MaterialSample instances and Occurrence (or Identification) as many-to-one. The reason for that is because it was easy to fit into the star schema system, not because there weren't many-to-many relationships, which would have to be documented in some other way.

Jegelewicz commented 2 years ago

Organisms/Agents (which I see as fundamentally the same thing, with the latter restricted to a single particular taxon)

YES! I have been arguing this within Arctos for some time now!

If you have loads of free time - this epic issue is interesting and @deepreef contributed there as well, but here are some of the "arguments for organism as agent" highlights:

https://github.com/ArctosDB/arctos/issues/1966#issuecomment-474532230 https://github.com/ArctosDB/arctos/issues/1966#issuecomment-515429487 https://github.com/ArctosDB/arctos/issues/1966#issuecomment-830288611 https://github.com/ArctosDB/arctos/issues/1966#issuecomment-832982254

Here is an example of an "organism" agent:

Kianga in Arctos

And the discussion continues.... https://github.com/ArctosDB/arctos/issues/3765

All living things are agents. We can treat Homo sapiens as special kinds of agents, but almost everything that applies to people also applies to other species. I don't know why I have to argue so hard for using the exact same model for other living things that we use for people and I don't feel like anyone has made a really good argument against it. I think that nobody wants to admit that Homo sapiens are just another species on planet Earth? Also, no one wants to add another table to their data or attempt to manage "mouse12345" just because part of it is somewhere else or it had 5 embryos. This is the social issue that needs to be overcome if we are going to do this well - IMO.

deepreef commented 2 years ago

In the "organism core" scheme I suggested, I treated relationships between "evidence" like MaterialSample instances and Occurrence (or Identification) as many-to-one. The reason for that is because it was easy to fit into the star schema system, not because there weren't many-to-many relationships, which would have to be documented in some other way.

Yes -- that's why I'm captivated by the idea of having ResourceRelationship at the Core of a star schema, because that can handle all one-to-one, one-to-many, and many-to-many relationships between, within, and among instances of all the other classes. I suppose one could even express relationships between instances of ResourceRelationship, but that's a bit meta for this late in the day, and is causing my head to hurt just thinking about it.

deepreef commented 2 years ago

All living things are agents. We can treat Homo sapiens as special kinds of agents, but almost everything that applies to people also applies to other species. I don't know why I have to argue so hard for using the exact same model for other living things that we use for people and I don't feel like anyone has made a really good argument against it. I think that nobody wants to admit that Homo sapiens are just another species on planet Earth? Also, no one wants to add another table to their data or attempt to manage "mouse12345" just because part of it is somewhere else or it had 5 embryos. This is the social issue that needs to be overcome if we are going to do this well - IMO.

This captures my own point of view perfectly! Indeed, we have many instances where individual non-human organisms have names, so it works both ways (i.e., all properties of dwc:organism apply to individuals identified as H. sapiens, and many/most properties of Agents as expressed in various data models could potentially be applied to non-human organisms. I know it's weird to think about, but from a data modelling perspective, it's practically a no-brainer.

But... we digress from the topic at hand...

dagendresen commented 2 years ago

More of an issue is that in Simple Darwin Core all you get is a row for an Occurrence, an Event, or a Taxon, but that "record" can be "about" lots of things at the same time, and the one I'm interested in publishing might not be the one that interests someone else searching biodiversity databases.

+1 If the basisofRecord value "I" am thinking of (when publishing a Simple Darwin Core dwc:Occurrence record) is different from the basisOfRecord value "you" will be thinking of when using the record in your research - why then do we need basisOfRecord at all? Rather simply use proper (resolvable) identifiers for the actual things we think of when we publish the records? And simply acknowledge that when we do not use such identifiers the basisOfRecord will not add much value anyway.

ten thumbs up for deprecating dc:type and dwc:basisOfRecord in favor of rdf:type

+1

MaterialSample as a type of evidence for a species occurrence What happens if the specimen is not originating from nature at all? Imagine the offspring from an animal kept in captivity (or a plant kept in cultivation) sampled and used as a type specimen as evidence for a new scientific name? If the specimen in a scientific collection lacks spatial or temporal information about the location or date from which it was sampled - is it then still (only) evidence for a species occurrence? A specimen or MaterialSample modeled (or published) as an instance of Occurrence always reduces the information value of the MaterialSample to be evidence for the occurrence of a taxon. While a specimen can be soooo much more. A specimen can be the evidence for a scientific name, it can be the source for DNA, a donor of genetic diversity for plant breeding, etc, etc. My argument is that specimens may have many other and different roles than only as evidence for a species occurrence! I will thus argue that we might rather need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

If modelling collection objects is your thing, please keep them separate from Darwin Core

It looks to me like many of us have a bit of agenda in how we describe the history, origin, and original purpose of Darwin Core? (Was the primary purpose of Darwin Core to describe evidence for species occurrences or to describe specimens in a collection?). Maybe rather focus on what Darwin Core has become (to many people) and where we want to go from here? Alternatively maybe it is time to start a new TDWG core for natural history specimens? E.g. a Aristotle Core ?? I hope not.

because it perpetuates the problem we currently have, which is forcing non-flat data about multiple things (e.g., Occurrence+MaterialSample+Identification+Taxon+Event+Location+etc.) into a flat structure, as though all these things exist simply for the purpose of documenting instances where organisms occur in nature

+1

deepreef commented 2 years ago

MaterialSample as a type of evidence for a species occurrence What happens if the specimen is not originating from nature at all? Imagine the offspring from an animal kept in captivity (or a plant kept in cultivation) sampled and used as a type specimen as evidence for a new scientific name?

An occurrence is an occurrence, regardless of whether an organism participated in the associated Event of its own volition, or was somehow assisted in being present at it's place and time with the aid of some other organism. You give the example of a captive organism, but the same applies to parasites or stomach contents of birds/fish/mammals/etc., pollen on bees, and countless other examples where the participation of one organism at an event was facilitated by another organism -- regardless of whether that other organism happened to be identified as Homo sapiens.

The point is, if DwC lacks the necessary terms to capture the properties we want to do our analyses, then we need to fill those gaps in DwC. But in any case, an instance of MaterialSample (with its associated metadata) still can serve as evidence of an occurrence.

If the specimen in a scientific collection lacks spatial or temporal information about the location or date from which it was sampled - is it then still (only) evidence for a species occurrence?

The same applies to images that lack corresponding metadata, or any other kind of potential evidence. Maybe instead of describing MaterialSample instances as a form of evidence, we should refer to them as "potential evidence". In any case, without spatial or temporal metadata (aka association with an Event), it's pretty useless in terms of serving as evidence of an occurrence -- regardless of whether it's an organism-based occurrence or a taxon-based occurrence. However, it can still serve as evidence of Identification. And the reverse may also be true -- that a specimen in a Museum serves as evidence of an organism occurrence, but is damaged such that the taxonomically diagnostic character(s) are absent, and therefore functions as weak/useless evidence of Identification. It can still have an Identification -- just not so much evidence supporting it.

A specimen or MaterialSample modeled (or published) as an instance of Occurrence always reduces the information value of the MaterialSample to be evidence for the occurrence of a taxon. While a specimen can be soooo much more.

+1 Exactly! This is why we're having this discussion -- to elevate MS to something much more than simple (potential) evidence of occurrence. That doesn't mean that MS instances cannot function as evidence of occurrences (and identifications) -- it just means that MS instances can have other functions in other contexts.

A specimen can be the evidence for a scientific name,

Well... sort of... only if it's a name-bearing type, and even then, it inherits it's evidentiary role through an instance of Identification. So, the pedant in me want to clarify that a specimen can be evidence for a dwc:Identification instance, rather than for a scientific name.

it can be the source for DNA, a donor of genetic diversity for plant breeding, etc, etc.

...or the subject an image, or an item sent on loan, or an object monitored over time to study preservation techniques (just to flesh out the "etc." a bit more...)

My argument is that specimens may have many other and different roles than only as evidence for a species occurrence!

+10!!

I will thus argue that we might rather need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

Yes, but isn't that exactly why this Task Group exists?

Maybe rather focus on what Darwin Core has become (to many people) and where we want to go from here?

Agreed! The "purpose" of DwC has evolved over time, and it continues to evolve, and that's a good thing in my opinion. From my perspective, it began as a way to exchange specimen data using a mostly-flat structure, then expanded to share information on occurrence records using a mostly-flat structure, then was implemented in a slightly less-flat star schema via GBIF. Now I think we're gradually easing towards more sophisticated structuring because we want to share information more precisely and robustly than we have before. Many/most content providers are not ready to go too far down this path, because of the limitations of software used to manage their data. But there seems to be critical mass among folks who have the capabilities to share information more precisely and robustly than we have before, so we're laying the foundation for the next-generation mechanism of biodiversity data exchange.

These are exciting times for biodiversity data nerds!

smrgeoinfo commented 2 years ago

Here's a UML conceptual diagram of what I'm understanding from this conversation, reduced to the concepts that seem to be central to the use cases.

image

Jegelewicz commented 2 years ago

need to model an independent MaterialSample class that is NOT reduced to (only) be evidence for a species occurrence.

We definitely need this, but I think we have multiple independent task groups working on it now?

https://www.tdwg.org/community/gbwg/enviro/

I really feels like these will overlap significantly if we do take a "giant leap".

dr-shorthair commented 2 years ago

I'm only an occasional DWC observer. I find it interesting to read the perspectives, though I may not fully understand the drivers for the different emphases. However, I do see some tension between the folk who use DWC primarily as a set of tags for data transfer, and the people who would like to conceive it as a model of an information system.

I generally work from the latter perspective. So the sketch by @baskaufs is useful. However, that is still cast in terms of tables, with noise from 'keys' and 'rows', etc. The view from DWC-SW puts it more conceptually. Nevertheless, there is clearly still some confusion.

Just above, @smrgeoinfo proposes a conceptual model that re-frames the discussion in a more general context. I believe this is based on (or at least it is compatible with) the OGC/ISO O&M and W3C SSN/SOSA (which I was involved in developing). This is based on a process-flow-model.

W3C Prov-O formalizes the components of a basic process-flow model:

PROV Starting Point classes

The key idea is to recognise that every Entity - both physical and informational - is the result of an Activity (a process or event) undertaken at a particular time and place, involving particular people and instrument(s), following a particular protocol, and using certain entities and information as inputs. Then the trick is to sort the activities, entities and information into sub-classes that are useful for your application. But in the first place I always find this distinction between Activity and Entity is a critical and very useful basic clarification when getting tangled up in models.

Important activity-types in empirical science are

  1. act-of-observation - the result of which is a piece of information, often a quantity or classification which is a characteristic of the entity-of-interest.
  2. act-of-sampling - the result of which is a special kind of entity, called a sample, which is somehow representative of some larger entity. A sample is expected to be the subject (entity-of-interest) of subsequent acts-of-observation.

In W3C SSN/SOSA these are modeled in a way that can be interpreted as types of prov:Activity, specifically

Observation-Prov-alignmentSampling-Prov-alignment

(SOSA calls it a 'feature-of-interest' instead of 'entity-of-interest'.)

There are lots of potential relationships between all of these things, and even more potential pathways that involve two or more steps. A practical information model will realize or implement a convenient subset. Which ones are convenient depends on your needs, in particular which class is central to your application perspective. Collection managers will focus on material samples; biodiversity people will focus on occurrences; ecologists on sites; etc. Each will have a different class at the middle of their star.
The diagrams immediately above put the act-of-observation and act-of-sampling at the centre.

Then the insights from SOSA are to focus on

When samples are involved, then the science-question might also require you to distinguish the proximate and ultimate entity-of-interest. The proximate one is often a sample. In particular, I think the proximate entity-of-interest would be the basis-of-record of the result of the act-of-observation.

The ultimate entity-of-interest may be something else entirely (a population, or an ecosystem, for example). Some scenarios showing both proximate and ultimate versions in the context of both sampling and observations are sketched out here https://www.w3.org/TR/vocab-ssn-ext/#fig-observation-ultimate-foi


A. The Prov-O Activity-Entity disjuncture matches the fundamental Occurrent-Continuant distinction from BFO, which is the basis of the OBO ontologies, and is called Perdurant-Endurant in some other systems.

B. A small extension allows us to fully align SOSA to OBOE which some of you may be familiar with.

jbstatgen commented 2 years ago

This is an exciting discussion and I am very happy where @Jegelewicz , @baskaufs, @deepreef and everybody else are taking us.

Working on and trying to fit together the concept for the "Digital and extended Specimen (DES)" concept, an ecological use case, and trying to catch up on the ideas here, I am only joining the exchange now. It is encouraging to see that we independently are attracted to the same/similar solutions.

Below is a diagram summarizing my current understanding. It is based on the discussions within the DES-community, Baskauf & Wells 2016 (@baskaufs Thanks for pointing out the reference!), my subjective highlights of the discussion here and the diagrams added to the wiki by @stanblum .

DarwinSWdiagram_20211013 drawio

A couple of remarks: "Token", "Event" and "Agent/Role" are taken from the recommendations of the RDA/TDWG Attribution Metadata Working Group (https://www.rd-alliance.org/group/rda-tdwg-metadata-standards-attribution-physical-and-digital-collections-stewardship/outcomes), which are based on the W3C-PROV ontology (https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/).

All Planetary and moreover Galactic Beings welcome: I am certainly a named Organism with the taxonomic classification "Homo sapiens" acting in one of the Role of an Agent here. Thus, the light grey connection looping around on the right side. Thank you all for this step away from a human-centric perspective (-> multispecies ethics). The Agent-box might be better placed between Organism and Token, however, we are here mostly focusing on the relationships surrounding Organisms and Tokens, thus, the Agent-box on the outside.

Working on definitions for "Digital Specimen", "Derived Data" and "Associated Data" in the DES context, I originally had equated "Digital Specimen" with "Token". However, a question that Anna Monfils asked for the European Frog Bit use case (cp. https://biss.pensoft.net/article/73814/) showed that it isn't as simple. The question was how to find all the data directly derived from a specific EFB individual, eg. the physical specimen, multiple plant-clippings for tissue samples, DNA-seqs., images, (audio recordings), etc. This is a one-step search within a network, however, it excludes one-step links to data closely associated. Such one-step links of a different "kind/type" are for example, links to other EFB individuals recorded within the same population, or to other species recorded in vegetation relevées describing the community in which the focal EFB individual was found, etc. Basically, all the data derived directly from the focal individual need to be linked to a shared entity/ID. Often this might be a physical specimen (ie. the focal individual), yet such a physical specimen might not have been collected (cp. also photo and audio/video recording of the same bird that wasn't caught). My solution was/is that a Digital Specimen fundamentally is a "bag of links", very much like the Organism entity discussed here. The EFB use case is the reason, why I see Token and Organism as separate entities.

In the context of "MaterialSample" discussed here, the Organism class/entity seems to be the core entity to which star-like all the other resources (Tokens, ie. different MaterialSamples, types) are linked. However, considering what @Jegelewicz asked above, my understanding is that all/most entities can be cores, depending on the question asked and user context. From a Plazi point of view a Publication which is a derived Token for us here, might be the core of their data model and thus basis of their star-diagram. Also in their use case, a core as bag of links might make sense: a publication has a DOI or ISBN etc., however, there are online versions and a host of hardcopy instances in libraries all over the world. There doesn't seem to be The parent hardcopy entity that can be defined as the origin of all other copies, derived data and links to associated information.

A Token in my concept is an abstract entity standing in for all the "types" that can provide evidence about an organism (observations, LivingSpecimens, seedbank lot, ...), an event, an identification, an agent, etc.

Tokens understood as evidence are closely related to the question of how reliable (cp. RA Fisher's statistical concept) this evidence is. At least at first sight, a physical specimen curated in a collection seems to provide more support/have more weight as evidence than a gal chatting you up in a bar (you might find out later that she is the world expert on a group of Peridinea and the anecdote she told you is quite solid). Traditional (ecological) Knowledge about species occurrences, habitats or distributions might never be recorded by physical specimens, publications, etc., however, being the accumulated (oral) knowledge and history over generations it might be solid scientific information and evidence. Also, in the point-line-intercept example provided by @albenson-usgs above, that data might not be easily reproducible (another boot trip to the GPS-coordinates soon after?), however the data are far from an anecdotal tourist snapshot. These digital-only, maybe also digitally-born datasets were consciously designed and follow standardized procedures of recording, identification, assessment, etc. On the other hand, a physical specimen with incorrect location information or identification, is unreliable and introduces error into an analysis.

The "types" certainly provide a general idea about reliability. An additional step further is to provide users per default options,eg. "fields" with which they can explicitly record their assessments or knowledge of reliability for all Tokens (several assessments pro token by eg. different users possible). I understand support or reliability as Bayesian priors (independent of how they are recorded by users, eg. for them it might be a choice of red, yellow or green). These Bayesian priors can be the result of calculations in hierarchical Bayesian models. An example is to calculate the support provided by a Token based on the number of links the Token has in the "digital extended network". This is similar to or leads to Bayesian network approaches (eg. https://en.wikipedia.org/wiki/Bayesian_network) for calculating posterior probabilities/likelihood support.

W3C PROV just got also introduced by @dr-shorthair while I wrote this contribution. I am looking forward to our discussion tonight.

jbstatgen commented 2 years ago

The most powerful solution would be to have ResourceRelationship as the primary core, serving as a "fat" triple-store of relationships among records (i.e., the ultimate join table), and then have each of the other DwC classes represented as extensions (effectively flattened cached values for the properties in each class where parity of property value to object is 1:1). This would get us a step closer to a future world where everything is serialized as RDF (I'm just term-dropping here -- I don't really understand what I'm saying). However, this is probably a bridge too far at this stage in the history of DwC and IPT; so it might be asking too much to make that leap right now (although if anyone is interested in exploring it, I'll gladly generate some sample datasets in this form).

@deepreef Would you generate an example of such a dataset in RDF? I think I have a (vague) idea how this will look like, though I might be completely off.

cboelling commented 2 years ago

Trying to put the many facets of this discussion into perspective. Here is my current understanding of the conceptual concerns - which I find helpful to separate from implementation and pragmatic issues.

I understand that the main intention and use case for dwc:basisOfRecord is to act as one element in a structure for representing the state of affairs expressed by the following two sentences:

(1) One or more individuals of taxon X were present (somewhere) within geographical location L (at some time) during time interval T. (2) Assertion (1) is based* on an artefact that, at least in part, is derived from or depends on said individuals of taxon X.

*possibly among other things

Sentence (1) describes (and asserts) an occurrence. The belief that what sentence (1) asserts for a given X, L, and T is actually the case is, in the context of biodiversity science in general and natural history collections in particular, based on inference chains that can be extremely varied and that can, as one important ingredient, have different kinds of artefacts as their starting points.

Examples of such kinds of artefacts are preserved biological specimen, photographs of biological specimen (taken at a particular place and time), or textual records of an observation, among many other categories that are useful to distinguish.

The term "derived from" in (2) is used to denote the relationship, however involved, between a biological organism that participated in the occurrence and an artefact that retains some of its physical substance, possibly altered by physical or chemical processes. Here are some examples. "at (L,T)" is a placeholder to denote a particular location and time interval in each case.

organism participating in occurrence artefact
a beetle crawling through in the Namib desert at (L,T) its sun-dried carcass collected from the sand some time later and put into a collection
a fish swimming in lake Michigan at (L,T) a flash-frozen tissue sample taken from it before releasing the fish again, stored in a tissue collection
a fish swimming in a certain location in the Baltic Sea at (L,T) its preparation in ethanol in a jar in a collection
a bird inhabiting the area now known as "Grube Messel" at (L,T) the part of its fossil remains derived from its skull, 48 million years later
a worm in a certain portion of soil at (L,T) the nucleic acid extract from a soil sample taken at location L containing some of its DNA

The term "depends on" in (2) is used to denote the relationship, however involved, between a biological organism that participated in the occurrence and an artefact that existentially depends on it - without necessarily retaining any of the physical substance of that organism. Here are some examples:

organism participating in occurrence artefact
a rhinoceros at (L,T) a photograph of that rhinoceros taken at (L,T) by a human observer
a nightingale at (L,T) the sound recording produced by an unmanned monitoring station in the vicinity of L for the entire night containing a recording of its singing from 11:00 pm to 11:10 pm
a coral of type X at (L,T) a written note stating that a coral of type X was present at (L,T)
a gall wasp at (L,T) the dried-up gall that was inflicted on the host plant by the wasp, which is added to a collection

A "derived from" relation in fact implies a "depends on" relation, but I think it is useful to distinguish the two because physical vouchers of biological organisms are the hallmark of natural history collections.

Categorizing a (naturally) dried gall as an artefact may seem odd, but for the purpose at hand I find it helpful (and justified) to think of artefacts as anything that has been subject to human intervention - even minimal interventions like picking up the dried gall and putting it in a collection.

Clearly, knowing the particular artefact (and what kind of thing it is, e.g., preserved specimen, photograph, field book entry) and the characteristics of the inference chain that results in asserting an occurrence is desirable as it is informative for determining how reliable that assertion is, and to draw all sorts of interesting conclusions about the occurrence itself, how its representation in an information system (such as GBIF) came to be, and to provide information on the artefacts involved, e.g., where to find them or digital representations of them.

Oftentimes, only a shorthand categorization of the artefact (e.g., that it is a preserved specimen, a photograph, a text document) is needed, rather than every detail of its relationship with the organism that forms part of the occurrence in question. This is so, because the chains of inference for artefact/occurrence pairs for a given category of artefact, while they will differ in their details, are similar in those characteristics that matter to users. That is, knowing that the artefact is a photograph rather than a preserved specimen gives a big chunk of the information that is relevant for a number of use cases.

This then, I understand, is the intended function of dwc:basisOfRecord: to provide a shorthand categorization of the artefact that is used to infer the happening of a particular occurrence by virtue of the actual relationship of that artefact with an organism that participated in the occurrence. Another way to put this is that dwc:basisOfRecord captures, in a shorthand way, the type of evidence for an occurrence.

Specifically, in such a representation dwc:basisOfRecord would be used to link a pointer to an occurrence (such as an occurrence ID or a description like (1)) to a pointer denoting a category of artefacts. The meaning of this linkage would be: My (the representation author's) belief that the stated occurrence happened is based on an artefact of the stated type.

If this is the intended role of dwc:basisOfRecord then this could be captured by defining:

dwc:basisOfRecord =def: A property the values of which indicate the type of artefact that the assertion of an occurrence is based on.

Some of the ensuing questions leading into the implementation and pragmatic concerns, and that have been raised in this thread, would perhaps be:

  1. Is the current practice of data sharing via Darwin Core Archives and the IPT tool doing justice to this intended meaning of dwc:basisOfRecord? If not, what needs to change?
  2. Should dwc:basisOfRecord be deprecated in favour of another structure that can capture these semantics in a better way (whatever constitutes this being better)?
  3. Is the shorthand categorization that can be provided by dwc:basisOfRecord as a single attribute with a single value that denotes a category of artefacts adequate or is a more elaborate structure required that enables capturing properties of the artefact and characteristics of its linkage with the occurrence in more detail?
  4. Which vocabulary for denoting categories of artefacts should be used?
wouteraddink commented 2 years ago

@jbstatgen regarding the digital specimen in the figure: this could be any tangible curated object, not only biological objects. tangible because you can look at it, feel it etc. in contrast with an observation which could include a tangible object like a photo but may be just text. In the case of a biological material sample, the object could include a number of individuals of different or the same species, one individual or part of an individual. biological material samples could also be tangible objects that are not species material but can be linked to a species, like bird nests, spore prints, foot prints..

deepreef commented 2 years ago

@deepreef Would you generate an example of such a dataset in RDF? I think I have a (vague) idea how this will look like, though I might be completely off.

Sure, you could generate the whole thing in RDF, which I imagine would be the ultimate solution. However, I was thinking of an approach that's one step shy of that, which still keeps us in the realm of "relational" thinking with instances of classes that have properties (i.e., maintains the "tables and fields" approach to modelling the data). But I think of ResourceRelationship as a "fat" triple-store (I guess it would be sort of an octuple-store). But the idea would be to take all of the inter-class and intra-class relationships and represent them as instances of ResourceRelationship, as the "core" in a star schema representation, with all the other classes as extensions.

I've been flat-out overcommitted in recent months/weeks, but as soon as I catch a breather, I'll put together an example dataset.

Jegelewicz commented 2 years ago

Task Group has decided that BasisOfRecord is leading us away from our goal and that this issue should be closed.