tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
201 stars 70 forks source link

Change term - MaterialSample #314

Closed Jegelewicz closed 1 year ago

Jegelewicz commented 3 years ago

Change term

From https://dwc.tdwg.org/terms/#materialsample

MaterialSample info
Definition A physical result of a sampling (or subsampling) event. In biological collections, the material sample is typically collected, and either preserved or destructively processed.
Examples A whole organism preserved in a collection. A part of an organism isolated for some purpose. A soil sample. A marine microbial sample.

From https://dwc.tdwg.org/terms/#livingspecimen

PreservedSpecimen info
Definition A specimen that has been preserved.
Comments  
Examples A plant on an herbarium sheet. A cataloged lot of fish in a jar.

Given the above, we propose that MaterialSample should be more specific to something less than what might be considered a "voucher" in order to delineate it from PreservedSpecimen.

Proposed new attributes of the term:

Note: all of the above is my interpretation of the Arctos Working Group conversation.

mjy commented 3 years ago

Second side note, I agree this is where the problem rears:

I believe that one important source of problems comes from the requirement to shoehorn all these things (MaterialSample, PreservedSpecimen, HumanObservation, etc) into an Occurrence to enable publication in GBIF.

As has been noted elsewhere the other problem is "what is an identifier"? When we multiple these two problems the combinatorics of the issues (A = shoehorn X B = what kind of identifier) it's ugly.

One thing that might, maybe, help, is to have an ontology of Identifier types, and to indicate what identifier type your identifier is. We do this in TaxonWorks, but we don't have the field in DwC to express this (I think). This would allow aggregators or others to act accordingly. Is this a "global" identifier, then I can build some functionality with certain assumptions. Is this a "local" identifier, well I better be more cautious. Is this a "physical" identifier (e.g. paper label on a specimen), I can infer some other things? Is this a "digital" identifier, oh, then I shouldn't look for it on a piece of paper (or should I?). If this is developed, we might mitigate, slightly, the issues at hand?

baskaufs commented 3 years ago

Wow. I had a busy day yesterday and could tell from the flood of notifications that this conversation was going on. I have many comments/responses to what has been said in this thread, but don't have time to write them all given that this is the start of another work day for me that is mostly unrelated to TDWG business. But I will record a few thoughts

Despite the disagreement over minutiae in this thread, it is really exciting to me that there is such a high level of agreement on how people are viewing the relationships among biodiversity-related entities. To put this in historical perspective, my interest in this topic dates back to 2010, when I published this paper making the case (somewhat heretical at the time) that organisms should have a place in the biodiversity knowledge graph and that we should be linking derived resources to them. Soon after that, we had the marathon tdwg-content email thread summarized here that reminds me a lot of this conversation. Fortunately, this conversation will be better documented, thanks to GitHub.

That conversation had three main consequences:

  1. The establishment of the dwc:Organism class.
  2. Deprecation of the dwctype: namespace and revising the previous (somewhat circular) class definitions into their current definitions (as of the 2014-12-23 version of the dwc: Darwin Core term list).
  3. Creation of Darwin-SW as an attempt to establish a graph model as a concrete expression of the relationships that had been hammered out in that discussion and incorporated into the new class definitions. (As Rich noted, the original manifestation was his cool sort of ASCIIgram in this email.)

I feel that the Darwin-SW graph model is a pretty good starting point at representing the relationships between classes. I've found it to be pretty congruent with other models as diverse as the 1993 ASC model and the ABCD ontology model. Two features that Darwin-SW includes that aren't as explicitly included in those models is generic, transitive hasDerivative/derivedFrom properties and hasEvidence/evidenceFor properties. That is to a large extent what is at the core of this current discussion.

If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation. I don't have the bandwidth right now to be that person, but I would happily participate as a core member.

I recall a session at the TDWG 2013 conference (I think it was the report-out of the VoMaG group) where the topic of creating a "hasEvidence" property came up. I really thought there might have been the impetus for it to happen then, but it fizzled out. At that point, I don't think that people were taking seriously the idea of Linked Data as a real thing and I think most people thought that the existing system worked well enough to handle data about preserved specimens, which were clearly at the center of the biodiversity informatics universe at that time. However, since that time there have been a number of serious attempts to link data (if not actually to use "Linked Data", i.e. RDF). There has also been a proliferation of derived resources (tissues, DNA, sequences), camera trap images, machine observations, iNaturalist and eBird observations, etc. that have made it clear that museum specimens do not have to be the center of the biodiversity universe. So I think the time may be ripe to try again for a Darwin Core "hasEvidence" or "isEvidenceFor" property as well as some more standardized way to indicate the relationship among derived resources.

Jegelewicz commented 3 years ago

If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation.

This. I don't think I have the bandwidth right now either, but I would definitely participate and help as much as my bandwidth allows.

Jegelewicz commented 3 years ago

One thing that I feel is missing from all of this is - what gets a catalog number? We struggle with this every day as some of the comments above demonstrate. It is this that made me bring up the issue in the first place. Collection managers and curators have been numbering stuff for centuries, but do those numbers function for the purposes of today's science? If not, what schema would be better? Philosophical discussions are indeed stimulating, but we also need concrete methods for others to follow. Let's not lose sight of that!

baskaufs commented 3 years ago

Before I put this aside for the day, I wanted to make an additional comment about the mechanism for documenting the relationships between resources. I think that the current "fixes": associatedMedia, associatedOccurrences, associatedWhatever are all just Band-Aids that we are using to fix a gaping wound. We are forced to use them because we are stuck with cramming normalized relationships into flat spreadsheets due to the limitations of the star schema system required by DwC-Archives.

I see the current efforts to "fix" the ResourceRelationship class as going a long way towards correcting this deficiency if we can figure out how to use it effectively and in a standardized way. One reason why I'm excited about this "fix" is that it seems possible to define a process by which ResourceRelationship spreadsheet data could be transformed into bona fide Linked Data (preferably in JSON-LD) that could then be pushed into a triple store and queried in an efficient way. That is, of course, contingent on people actually being able to mint and track IRI identifiers for things, which is a difficult nut to crack.

Given the existence of actual Linked Data (i.e. RDF) representations of the ResourceRelationship relationships, it would then be possible to perform a "dumb-down" operation that would replace the ResourceRelationship instances with a single linking property that would directly connect the resources involved. A model of this kind of process can be found in the SKOS model for handling labels. Section 4.3 of the SKOS Primer describes a process by which label instances (described in SKOS-XL) that can have their own provenance and metadata can be transformed into simple SKOS property links (skos:prefLabel, skos:altLabel, skos:hiddenLabel) that can be used to directly link concepts to their labels. (You can see this process in action by examining any of the Getty Thesaurus of Geographic Names item RDF dumps, for example this one) The analogous situation for us would be to collapse ResourceRelationship instances with their own provenance and metadata into simple linking properties like isDerivedFrom or hasDerivative. As @camwebb and I describe in Section 3.3.2 of our paper, it then becomes a trivial query to discover all derived resources using the * SPARQL property path operator, if the hasDerivative property is transitive.

Jegelewicz commented 3 years ago

@deepreef will you be sharing your thoughts and processes?

Over the next 4 months, I will be updating the core data model behind our collections data, and one of the specific issues that our CMs need to "fix" is the way we track physical objects in our collections -- i.e., as instances of MaterialSample.

Because we are all struggling with this....

mjy commented 3 years ago

@Jegelewicz I would be interested in aligning with your process but for TaxonWorks. We'll be extending as well (FieldOccurrence), and we just added Extract classes. Maybe a simple toy ontology of classes would help in this regard.

@baskaufs

That is, of course, contingent on people actually being able to mint and track IRI identifiers for things, which is a difficult nut to crack.

It might be getting closer. I think we can handle much of this in TaxonWorks (though I have some doubts about the re-ification process), this because we can stack as many identifiers as needed, including the requisite UUIDs on our instances. If Arctos is going through the same machinations we might have targets from multiple real applications to play with (very) soon.

baskaufs commented 3 years ago

@mjy Cool! Real data and real applications are always good.

On the subject of UUIDs, in the imaginary process I described of turning ResourceRelationship relationships into bona fide Linked Data, it would not necessarily be required that the identifiers used for the resource relationship IDs be HTTP IRIs. If they were UUIDs, as a part of the mapping/transformation process one could just slap "urn:uuid:" in front of them in accordance with RFC 4122 and voila! they would be valid to use in RDF triples. They would not be dereferenceable, but who cares? In the process I described, they would just be dumped into a graph database for querying and not really exposed to the web anyway. That should make @deepreef happy, since he has traditionally had issues with requiring (potentially non-persistent) HTTP IRIs as globally unique identifiers.

wouteraddink commented 3 years ago

It is interesting to observe that people en masse try to provide their specimens data through a standard for occurrences (DwC) while TDWG actually has a standard for biological collections data (ABCD). Not that this would solve all issues though. As @dagendresen mentioned already, many problems come from the requirement to shoehorn all things (MaterialSample, PreservedSpecimen, HumanObservation, etc) into an Occurrence. The Digital Extended Specimen concept and openDS specification under construction seem to provide an easy solution to some of these problems by having different objects for e.g. specimens, multimedia, measurements, identifications, gathering events, each with their own PID and link these. This separation in classes seems also the direction GBIF want to take in the next few years, and has its roots in the earlier idea to create a TDWG ontology and the vision of Donald Hobern.

mjy commented 3 years ago

Having a standard and implementing the standard are two very different things (that need to come much closer together IMO). For example, nowhere do I see ABCD here https://www.gbif.org/dataset-classes, why? In some ways raising ABCD and future standards together illustrates exactly how the frustrations here emerged, I suspect. If ABCD addresses needs why is it not more ubiquitous? Unless the proposed standards that are upcoming work much closer with the development of the applications/APIs that will use them I see nothing but similar problems coming with them as well.

deepreef commented 3 years ago

@baskaufs : THANK YOU for jumping in! I went a bit nuts yesterday and got too frothy in the mouth with my evangelism, but you very nicely brought it back to a practical trajectory (as you always do!)

If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation. I don't have the bandwidth right now to be that person, but I would happily participate as a core member.

Yeah, same here. Obviously, count me among the enthusiastic participants. I may be a little too close to things to take on the prime role. Besides, I am woefully inadequate in the "organizational skills" department. However, if no one else steps up to lead this effort, I would be willing to take it on, starting in a month or so from now.

@Jegelewicz :

will you be sharing your thoughts and processes?

Yes, absolutely! Is this the right place to do it? Most of the "heat" centers on MaterialSample, so there would be some logic to continuing this discussion under the banner of this issue. Or, perhaps if a task group comes to fruition, that would be the better forum of discussion.

One thing that I feel is missing from all of this is - what gets a catalog number?

Yup, we're struggling with this too. From the perspective of most of our CMs, it's not a "thing" without some human-friendly number slapped on to it. Long ago I came to realize that a catalog number should be treated just like any other property of the "thing", not the "thing" itself. For all kinds of reasons, catalog numbers make for bad primary keys on data tables, and even worse as persistent identifiers. In the case of specimens, they make a lot of sense as useful tags because there's no other easy way to refer to a specimen object semi-uniquely in a human friendly way (e.g., "The fish identified as Aus bus collected by John Smith in the Maldives in October of 1975"). As an aside, this is analogous to what scientific names of organism were like before Linnaeus came along and gave us a much more convenient/consistent system of labelling taxa). So it's not that I think catalog numbers are a "bad" thing -- I think they're great! I just think their utility as unique identifiers is limited, and we shouldn't slap them on things "just because". But this is one of the areas we'll be exploring in the coming months as we forge ahead with our data remodelling effort.

Jegelewicz commented 3 years ago

before Linnaeus came along and gave us a much more convenient/consistent system of labelling taxa

ROFL. Do you work with taxonomy?

I just think their utility as unique identifiers is limited, and we shouldn't slap them on things "just because"

I sort of agree, but sometimes it is the thing exposed "just because" someone slapped a catalog number on it that leads to really interesting research...

deepreef commented 3 years ago

That should make @deepreef happy, since he has traditionally had issues with requiring (potentially non-persistent) HTTP IRIs as globally unique identifiers.

Yeah.... so, if you think my posts on MaterialSample are too long, you don't want to get me started on identifiers...

But yeah -- as much as I understand and sympathize with the TBL LOD idea of committing to HTTP IRIs as the common identifier (largely because they are inherently "actionable"), the fundamental concern I have is that they combine dereferencing metadata and identification in the same string. There are lots of reasons why this is (or at least often can be) a "fragile" state of affairs. I won't dive into this here, but if anyone is interested, most of what I wrote here still represents my current thinking.

In any case, I agree with @mjy and @baskaufs (and others) that identifiers are lurking behind these discussions, because they represent the proxies of the conceptual objects we're deliberating here. The first and most important step in minting an identifier for something, is understanding what that "something" actually "is". I think @mjy nailed it with his earlier post about the need to be careful about deprecating classes rather than "changing" their meaning. I think the fundamental problem we have with DwC is that we don't have a clear enough understanding of what each of the main classes means to even know if we're changing them. So perhaps the first step is to lock down more robust definitions. Occurrence is arguably the most important class in DwC, yet its current definition hinges on the definition of Organism, and per the cougar example above, we're not clear on whether the trace blood collected downstream from where the cougar ate its fishy lunch constitutes part of the organism, or merely evidence of the organism (and, hence, we're not sure how many Occurrences we need to mint to capture the information we want to capture).

deepreef commented 3 years ago

ROFL. Do you work with taxonomy?

OK, I guess I set myself up for that one! :-) But, to be fair, before Linnaeus came up with his system, the "names" that naturalists used for taxa were along the lines of:

and

If you think modern taxonomy/nomenclature is difficult to capture in information systems, imagine trying to keep track of taxa in a structured way if you had to use names like those instead of genera and species. As complex as it is, the fact that the same system of scientific nomenclature has endured for more than a quarter of a millennium has to say something about its utility...

Jegelewicz commented 3 years ago

It actually isn't the system that is the problem, it is that we don't document anything well enough.....

deepreef commented 3 years ago

It actually isn't the system that is the problem, it is that we don't document anything well enough.....

YES!! VERY well said!!!

deepreef commented 3 years ago

BTW, in case anyone thinks I made up those pre-Linnean names, in fact they were both on the same page of the same publication.

deepreef commented 3 years ago

@wouteraddink :

The Digital Extended Specimen concept and openDS specification under construction seem to provide an easy solution to some of these problems by having different objects for e.g. specimens, multimedia, measurements, identifications, gathering events, each with their own PID and link these.

Is the DiSSCo GitHub the best place to participate in that discussion? Or is there another forum or email list or something where the main discussion is happening?

wouteraddink commented 3 years ago

@deepreef yes, on https://github.com/DiSSCo/openDS for participation in openDS discussion (still in early development). At the core of openDS is MIDS (minimum information about a digital specimen, being discussed here: https://github.com/tdwg/mids, and the Digital Extended Specimen concept (convergence between digital and extended specimen concepts) has been discussed in the global consultation: https://discourse.gbif.org/t/converging-digital-specimens-and-extended-specimens-towards-a-global-specification-for-data-integration/2394 and is also discussed in regular meetings organised by BCON with participation of DiSSCo, iDigBio, GBIF.

thomasstjerne commented 3 years ago
  1. Related to this, was the whole bird in the freezer an instance of MaterialSample, serving as a "parent" of the three derived MaterialSample instances (Skin, Tissue, Skeleton)? (perhaps suggesting the need for a new term parentMaterialSampleID?)

parentMaterialSampleID has been suggested to GBIF for both the splitting of, say, a bird (skin, bones, etc..) but also for subdivision of environmental samples (soil, water, gut content).

RogerBurkhalter commented 3 years ago

From a CM point of view....Within my CMS I deal with clusters of fossils on a rock slab and other forms of multiple specimens on or attached to a single "holder" (microscope or micropaleo slides, i.e. forams, ostracods, conodonts, etc.). They are "parent" objects that receive a UUID but no catalog number. The children are catalogued specimens (each with UUID's). The parent is a "loanable object", you cannot loan a single specimen from the parent without all of its children. The "slab of rock" has a UUID because it has its own characteristics that can be recorded in ABCD-EFG extensions (geochemical, physical properties) which relate back to each child. It gets complex, this is a huge rabbit hole. That I also record derivative specimens (coal ball peels, serial thin sections through a coral) in similar ways is what I am working on now. These are also parent/child relationships similar to other derivatives (histological or skeletal preps, but probably need a separate use case, i.e. they can be loaned separate from the parent.

deepreef commented 3 years ago

Thanks, @RogerBurkhalter -- we have very similar situations (both parent aggregate instances of MaterialSample, and child derived instances).

Perhaps it's time to submit a new issue proposing a new term parentMaterialSampleID within the MaterialSample class?

Edit: Note: the link provided by @thomasstjerne to the discussion on GBIF, where @timrobertson100 suggests proposing this term within DwC (which I strongly support, and will submit unless someone else would prefer to submit it).

deepreef commented 3 years ago

Since we're talking about MaterialSample as a hierarchy, it feels like the right time to toss another "grenade" (firecracker?) into this discussion. OK, that's overly dramatic: more like a practical question to see how others deal with the problem I'm about to describe.

Some of our collections assign catalog numbers to whole specimens 1:1 (one number for one specimen), whereas others assign catalog numbers to "lots". With a hierarchical MaterialSample, this is pretty easy to deal with, because the multiple specimens in a lot (each representing a separate instance of MaterialSample) can link (via parentMaterialSampleID) to another instance of MaterialSample that represents the lot. The catalog number can be attached to the "lot" instance, and the specimens then inherit the catalog number.

Where things get weird (for me, at least) is how to deal with all the lots in our lot-based collections that consist of only a single whole specimen. Specifically, should we assign the catalog number to instances of MaterialSample where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?

Of course, this question presumes that we assign some sort of "type" to each MaterialSample instance (perhaps we need to propose another term for materialSampleType?) We do this, but maybe that's an artificial classification that isn't really needed. If we don't have a materialSampleType property, then we obviously would not want to generate two separate MaterialSample instances. However, I have to believe that people will want to be able to distinguish "lots" from "whole organisms" from "organism parts", from "tissue samples" (etc.). Or, maybe that information is best captured in preparations?

My question to those following this thread/issue is: How do you deal with MaterialSample instances when you have lot-based collections, in terms of managing lots consisting of a single specimen?

I hope this makes at least some sense...

Note: I see that @timrobertson100 has encouraged me to propose parentMaterialSampleID (which I will do tomorrow unless someone else wants to). Would this group also support proposing materialSampleType?

campmlc commented 3 years ago

We have both of these scenarios in our collections: "should we assign the catalog number to instances of MaterialSample where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?"

In our fish collection, the catalog number is assigned to the lot, and each fish within the lot is a part =MaterialSample? of the lot. It can be difficult to then track child samples = tissues for example, of each fish in the lot, and derivative DNA sequences of each fish,back to the source individual organism as a subcomponent of the lot. However, in our genomics collection, a single fish from a lot is split out and given a separate catalog number, linked to the original lot by a "same lot as" relationship to a cataloged lot url. The single cataloged fish then has multiple tissue types = MaterialSamples, e.g. separate vials with fin clip, muscle sample etc, associated with the single catalog item = specimen = organism in this context. Then each part of the fish = MaterialSample can be subsampled for loans, creating child material samples, which then link to sequence data and publications etc. This is more manageable. If we have both these scenarios, then other collections will also. We need the flexibility of working with either.

On Tue, Apr 27, 2021 at 2:03 PM Richard L. Pyle @.***> wrote:

  • [EXTERNAL]*

Since we're talking about MaterialSample as a hierarchy, it feels like the right time to toss another "grenade" (firecracker?) into this discussion. OK, that's overly dramatic: more like a practical question to see how others deal with the problem I'm about to describe.

Some of our collections assign catalog numbers to whole specimens 1:1 (one number for one specimen), whereas others assign catalog numbers to "lots". With a hierarchical MaterialSample, this is pretty easy to deal with, because the multiple specimens in a lot (each representing a separate instance of MaterialSample) can link (via parentMaterialSampleID) to another instance of MaterialSample that represents the lot. The catalog number can be attached to the "lot" instance, and the specimens then inherit the catalog number.

Where things get weird (for me, at least) is how to deal with all the lots in our lot-based collections that consist of only a single whole specimen. Specifically, should we assign the catalog number to instances of MaterialSample where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?

Of course, this question presumes that we assign some sort of "type" to each MaterialSample instance (perhaps we need to propose another term for materialSampleType?) We do this, but maybe that's an artificial classification that isn't really needed. If we don't have a materialSampleType property, then we obviously would not want to generate two separate MaterialSample instances. However, I have to believe that people will want to be able to distinguish "lots" from "whole organisms" from "organism parts", from "tissue samples" (etc.). Or, maybe that information is best captured in preparations?

My question to those following this thread/issue is: How do you deal with MaterialSample instances when you have lot-based collections, in terms of managing lots consisting of a single specimen?

I hope this makes at least some sense...

Note: I see that @timrobertson100 https://github.com/timrobertson100 has encouraged me to propose parentMaterialSampleID (which I will do tomorrow unless someone else wants to). Would this group also support proposing materialSampleType?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/314#issuecomment-827891963, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBEL2XKZTYXNTZDBG73TK4J7RANCNFSM4WOSVQEQ .

mjy commented 3 years ago

We (TaxonWorks) use something that more or less maps 1:1 to materialSampleType that in part reflects the nature of the enumeration (count, as asserted by the curator) of the number of whole organisms, so we would use it. In our case our types are Specimen (count = 1), Lot (count > 1), and RangedLot (count is curator definable into categories with min/max). We have various physical (or once physical) entities (e.g. Extract, Sequence) that can be derived from each other and these classes, these types would also be map-able to materialSampleType.

Our OriginRelationship lets us define the parentMaterialSampleID in a generic way, so we could use that too. [ All terms sensu the definitions in TaxonWorks, perhaps not as generally used]]

RogerBurkhalter commented 3 years ago

In my CMS (SQL Server, custom), we now use Lots rarely, when I started 23 years ago much of the collection was cataloged as Lots. I now use that for bulk samples, residues, or otherwise objects that lack a determination. These are for internal use and are not currently shared via the IPT, so have not been mapped. The ParentSample (MaterialSample) has several use cases defined by fixed vocabulary for each of three groupings: Single object, Natural groups or Derived groups. Single objects are simply a single specimen (UUID the same as the specimen). Natural groups include: Clusters (fossil-bearing rock rich in abundance, with a UUID), those (fixed vocabulary terms) include: Death assemblage Reefs Transport/deposition, taphonomic clusters Condensed bed Coquina or bonebed Coal balls multiple fossils in Amber Parts and counterparts Articulated vertebrate remains Epibionts Derived groups (usually based on preparations, with a UUID) include (fixed vocabulary terms): Palynomorph slides Diatom slides SEM stubs with: Multiples of a Single taxon from a single Locality Multiples of a Single taxon from multiple Localities Multiple taxa from a single Locality Single Locality from a single Locality Microfossil cavity slides or gridded cavity slides with: Multiples of a Single taxon from a single Locality Multiples of a Single taxon from multiple Localities Multiple taxa from a single Locality Multiple taxa from multiple Localities Coal Ball Peels Microfossil thin sections Serial thin sections of an individual fossil These are most of the combinations were have come up with, for now. I participated in a iDigBio Paleo Digitization Happy Hour last summer where these terms were put forth. We used the term "Artificial" instead of derived, but derived is a much better term. Lately, I have been looking at who (what) to attribute identifications when those are made by machine via deep learning AI/CNN? So much to do.

RogerBurkhalter commented 3 years ago

I need to mention that specimens in the Natural groups receive individual catalog numbers and UUID's, where possible (specimens may be stacked or poorly exposed), while many of the derived group microfossil specimens also receive individual catalog numbers, palynomorphs may not. Serial sections have been cataloged with the Parent catalog number as they represent one individual, appended with a letter or decimal number.

tucotuco commented 3 years ago

Related issues are Issue #1, Issue #3, Issue #24 (reopened because of renewed interest), Issue #332, Issue #344, Issue #345, Issue #346, and Issue #347.

dshorthouse commented 3 years ago

At the risk of yet more cans of worms (though it was raised to an extent by @mjy), where then do we associate identifications (and their histories) in this discussion of MaterialSample vs Occurrence? Can both of these have associated identifications? If so, then the way we construct Darwin Core Archives may unravel.

Take for example a pinned bee as a MaterialSample with pollen in its corbiculae. If you scrape the pollen off and mount it on a slide, you now have two MatieralSamples each with different identifications (and subsequent identification histories). We could argue that there is still the one Occurrence, but we might also argue that there was two (or more) – the bee was the collector of the pollen. Nonetheless, there are divergent determinations for different parts of the MaterialSamples and we're incapacitated by the star schema of the Darwin Core archive.

deepreef commented 3 years ago

@dshorthouse :

At the risk of yet more cans of worms (though it was raised to an extent by @mjy), where then do we associate identifications (and their histories) in this discussion of MaterialSample vs Occurrence? Can both of these have associated identifications? If so, then the way we construct Darwin Core Archives may unravel.

In my mind, the only DwC class to which Identifications should apply is Organism. Alas, this is not something that most CMS systems accommodate, but at least logically that's the "thing" to which an assertion about taxonomic identity applies.

I suppose that as long as we have this simple/flat way of sharing all data as instances of Occurrence, and in most cases the ratio of Occurrence:MaterialSample:Organism is 1:1:1, then it doesn't matter from an implementation perspective. So, as with MeasurementOrClass or RelatedResource classes in DwC, perhaps the short-term solution is that the subject of an Identification instance can be an instance of any one of several different DwC classes (Organism, MaterialSample, Occurrence)?

But for implementation builders, I would strongly encourage that Identification instances link directly to Organism instances, as represented in the DSW graph.

Take for example a pinned bee as a MaterialSample with pollen in its corbiculae. If you scrape the pollen off and mount it on a slide, you now have two MatieralSamples each with different identifications (and subsequent identification histories). We could argue that there is still the one Occurrence, but we might also argue that there was two (or more) – the bee was the collector of the pollen. Nonetheless, there are divergent determinations for different parts of the MaterialSamples and we're incapacitated by the star schema of the Darwin Core archive.

So... the "right" way to handle this, I think, is that the bee and the pollen represent two different Organism instances, each with their own taxonomic identity. That means two separate Occurrence instances as well. But a MaterialSample can consist of multiple organismns/taxa, so the bee+pollen could be one MaterialSample as an aggregate of the bee+pollen+any othe parasites/symbionts that the bee happens to carry with it.

And yeah, I'd definitely be down with crediting the bee as the collector of the pollen!

campmlc commented 3 years ago

@deepreef I agree with the bee and the pollen as different organism instances, along with all the bees multiple ecto and endoparasites and viruses, each with their own taxonomic identity. But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping. In our system, this linking place+time is the collecting event. I'm still learning the dwc terminology - how would that be captured? As for the MaterialSample, I can see the bee+pollen+parasites being mapped to a single MaterialSample while the bee is on a pin in an insect collection. But the minute someone scrapes off the pollen, puts in on a slide, gives it an ID and perhaps a new catalog number, and puts the slide in a slide box, that becomes another MaterialSample, correct? Ditto for someone pulling off the mite under the bee's wing, sending it to another researcher on loan, who gives it an ID and uses it to generate a DNA sequence? So these would all be additional MaterialSamples ("child parts") of the original bee record, or they could be new MaterialSamples that are their own parent samples to further "children". Am I understanding this correctly? All of these categories are going to split and shift over time into different categories of a tree schema. Which is why we really need to have some sort of overarching "parent" , which is really the place+time+collection object , which may only initially include a single taxon but which in reality, if you include parasites and pollen and viruses which may or may not be split off and identified, includes multiple taxa. "ParentMaterialSample" seems like the wrong word. Maybe "Occurrence" is correct if it can allow for multiple taxa?

deepreef commented 3 years ago

@campmlc :

But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping.

place+time = Event. So when I collect a bee that has pollen and three parasites, I would:

How MaterialSample fits into it depends on how we precisely define the boundary between Organism and MaterialSample (see my super-long rantings above). The question I think you're asking, which is the same question I am ultimately trying to answer, is: How do we link MaterialSample instances to Events? The obvious answer is "via the relevant Occurrence instance(s)" But the problem is, as you note, if Occurrence = [Event] + [Organism], how do we actually connect a single MaterialSample (aggregate bee+pollen+3 parasites) to a single Event? At face value, it would need to pass through five Occurrence instances. But that seems unnecessarily cumbersome. And that is the crux of what I'm trying to wrap my head around: what is the actual relationship between Organism and MaterialSample?

I still feel the answer lies in treating MaterialSample as one of several examples of "Token", as represented in the DSW diagram I keep referring to. But this gets complicated when you have an aggregate MateralSample extracted from nature in a single Event, but there are unknown number of Organism instances represented within the MaterialSample.

I have some ideas on this, but more discussion is definitely needed.

As for the MaterialSample, I can see the bee+pollen+parasites being mapped to a single MaterialSample while the bee is on a pin in an insect collection. But the minute someone scrapes off the pollen, puts in on a slide, gives it an ID and perhaps a new catalog number, and puts the slide in a slide box, that becomes another MaterialSample, correct?

Yes, that's how I imagine it.

Ditto for someone pulling off the mite under the bee's wing, sending it to another researcher on loan, who gives it an ID and uses it to generate a DNA sequence? So these would all be additional MaterialSamples ("child parts") of the original bee record, or they could be new MaterialSamples that are their own parent samples to further "children".

Yes -- there can be n-number of "generations" in a MaterialSample parent-child lineage (i.e., fleas upon fleas upon fleas, etc.)

Which is why we really need to have some sort of overarching "parent" , which is really the place+time+collection object ,

Yes -- I think that part is manageable. As @dshorthouse mentioned in a related context, there is subjectivity in the edge cases for splitting up the various MaterialSample instances (and assigning them to a materialSampleType), but for the most part I don't see a problem with n-tier partitioning and/or aggregating. The tricky part (as discussed above) is how the Event data get linked to the MaterialSample instances.

Jegelewicz commented 3 years ago

fleas upon fleas upon fleas, etc

Ah, reminds me of campfires with my dad and his guitar....

There's a flea on the fly on the wart on the frog on the knot on the log in the hole in the bottom of the sea....

deepreef commented 3 years ago

There's a flea on the fly on the wart on the frog on the knot on the log in the hole in the bottom of the sea....

Ha! I remember that one as well! I used to LOVE it as a kid (still do, but that's because I'm still a kid in most respects). It should become the anthem for MaterialSample.

Jegelewicz commented 3 years ago

@tucotuco What's the standard for kicking off a Task Group?

tucotuco commented 3 years ago

The process is outlined in the Task Groups section of the TDWG Process document. The first task is to create a charter for the group. An example of a charter for one Task Group with a Darwin Core vocabulary enhancement that has just successfully achieved its goals is that for the Chronometric Age Extension. Two more for currently active vocabulary enhancements for Darwin Core are Humboldt Core and OSR - How Did It Die?. Task Group charters are linked at the bottom of the parent Interest Group page, such as that for the Observations & Specimens Interest Group and for the Earth Sciences and Paleobiology.

tucotuco commented 3 years ago

A Task Group on this subject should take a serious look at the Semantic Sensor Network Ontology, and the sosa:Sample in particular.

matdillen commented 3 years ago

I've been reading through this thread and it has been a lot to digest. Still, it got me thinking on what the relationship actually is between the physical biological specimens we curate and the occurrences of organisms they represent. One element that I seem to be missing in these discussions is the Observation.

There has been a lot of discussion about the Event of an Organism occurring at a certain time and place. This Occurrence is what we try to connect to our Specimens. But it seems to me that there is a key node in between: the Observation of this Occurrence. A physical Specimen can not possibly be connected to an Occurrence without an Observation taking place. A physical Specimen implies a record in some shape or format of this Observation of an Occurrence, be that record the whole organism dried and stuck on a sheet of paper, a blood sample of the organism or even a drawing of it. Observations in this sense can be made by human agents, but also by drones or automated sampling machines.

An Observation of an Occurrence does not have to coincide in space and time with that Occurrence. For instance, one may observe an animal footprint and deduce the occurrence of that animal earlier. Also, one may observe a fossil and deduce the occurrence of that organism a long time ago. One may observe a drowned rare bird and deduce its occurrence earlier in another less wet location.

This solves some of the ambiguity problems, as multiple Observations can record the same Occurrence of an Organism. Different Specimens can connect to a single Observation of an Occurrence, and constitute evidence for this Observation. Specimens can be samples or duplicates of other Specimens. A single Observation can record multiple occurring Organisms.

Specimens can then be connected to an Observation in various ways. That is, the Specimen constitutes

I'm not sure about the distinction for 'significant modification'. This is in part the difference between living and nonliving (preserved), but it's more complicated than that in practice. Is a piece of fur, a shark tooth or some birch sap living or preserved? An extra distinction between organism parts and organism products may be helpful here, but is a bit of a can of worms itself.

Applying this to the example of a pinned bee with 3 parasites and pollen, we get:

If we construe the bee collecting the pollen as an observation event itself, then we have a material sample that connects to multiple observation events. The observation in this case is not the bee collecting the pollen, but the observed pollen attached to the bee providing evidence for the pollen being collected by the bee earlier. This can also happen if we sample a plant damaged by deer or a whale with squid scars. In the same way, a fossil sample represents both the observation of the occurrence of a fossilized organism and the observation of the occurrence of an organism a long time ago.

The relationship between the observation event and the sample is direct: the sample is a product of the event. There is also some ambiguity with regards to specimen vs material sample. I like the definition of specimen being directly tied to curation, whereas a material sample is any physical object that is the result from an observation or the mutation of another sample. Hence, a specimen is a material sample, but a material sample may not be a specimen. This is particularly relevant when considering digital specimens: an observation may have as a material sample only the sensor output from a digital camera. This output is almost immediately digitized and otherwise lost. The digital recording may be curated as the recording of an observation (and hence evidence for it), in which case it is a digital specimen.

I know I've added another wall of text to an already extremely long discussion and I apologize for that, but I felt it important to get my thoughts somewhat in order and do a sanity check of whether this could be helpful.

deepreef commented 3 years ago

@matdillen :

A physical Specimen can not possibly be connected to an Occurrence without an Observation taking place.

I've thought about this a lot as well, and somewhere recently (not sure if in a post on this issue, or somewhere else), I made the point that many collected specimens are observed before they are collected. In our case, we often observe them first, then capture an in-situ image of them, then collect the specimen. I see these as three separate pieces of "evidence" to support the Occurrence, but of course it's only one Occurrence (one Organism, one Event).

However, there are plenty of cases where organisms are collected without first being observed. Think trawls and plankton tows, and insect traps, etc.

A physical Specimen implies a record in some shape or format of this Observation of an Occurrence, be that record the whole organism dried and stuck on a sheet of paper, a blood sample of the organism or even a drawing of it.

Agreed! Hence my frequent references to "Evidence" as a "thing" in our data universe. Technically, though, the physical specimen itself does not represent evidence of the Occurrence. It can certainly serve as evidence of taxonomic Identification; but the actual "evidence" of the occurrence is the data label containing information about the circumstances of how the specimen was extracted from nature. This might seem like splitting hairs, but consider the circumstance when labels of two different specimens of the same taxon accidentally get switched (it happens -- researchers working on a species sometimes return fish specimens to the wrong jar, for example). I suppose in some cases properties of the specimen itself could be used to corroborate the time and/or location of collection, but I imagine that's the exception, rather than the rule.

An Observation of an Occurrence does not have to coincide in space and time with that Occurrence. For instance, one may observe an animal footprint and deduce the occurrence of that animal earlier.

I think this is a really good point, and relates to that earlier example from @dshorthouse with the cougar blood being collected in a water sample downstream.

Specimens can be samples or duplicates of other Specimens

This reminds me of something else I meant to point out earlier. My understanding of "duplicates" is "more than one MaterialSample derived from the same Organism". I think this concept is used mostly in botanical circles, but I wonder whether its consistently used to mean the more explicit, "more than one MaterialSample derived from the same Organism from the same Event" (i.e., multiple pieces of evidence for the same Occurrence)?

Specimens can then be connected to an Observation in various ways. That is, the Specimen constitutes

The bullet list you provide is, I think, very helpful. I went through each example and imagined how I would capture the information with respect to Events, Organisms, Occurrences, and MaterialSamples -- but I wonder if everyone would arrive at the same conclusions for how to do that.

I guess a lot of how we slice this depends on how we define "observation". For example, if I drag a plankton net through the ocean, then dump the contents into alcohol and eventually get around to examining them months later back at the lab, was there ever an "Observation" to serve as evidence in support of an Occurrence? In my view, no. I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" the Organism at the moment of an Occurrence. In cases where I first observe, then photograph, then collect an Organism, I generally don't bother adding a separate record of the "observation" as evidence, figuring that it is superseded by the image and the MaterialSample/Specimen. Generally, I track Observations only when there are no MaterialSample or recorded media available to support the Occurrence.

If we construe the bee collecting the pollen as an observation event itself, then we have a material sample that connects to multiple observation events.

This is another excellent point, and one I'm going to need to digest a bit more (in the shower/stuck in traffic/staring at my ceiling at night). Certainly an in-situ image can serve as evidence of multiple Occurrences, so I can see the same for MaterialSamples as well. The most obvious/common example in my world would be stomach contents. This also raises the issue of non-human organisms as "collectors", and hence "Agents", and hence indirectly supporting the parity of "Agent" and "Organism" (as discussed elsewhere).

I like the definition of specimen being directly tied to curation, whereas a material sample is any physical object that is the result from an observation or the mutation of another sample.

I still don't favor this distinction. I think MaterialSample necessarily involves an element of curation -- even if the "curation" is limited to the original act of collection. That leaves open the question of whether an observed (but untouched) skull in-situ is itself an instance of MaterialSample, or Organism, or something else. This, of course, comes back to my initial question of: What is the distinction between an Organism instance and a MaterialSample instance. I think "curation" definitely has something to do with it, but we still need to define that word. I'm not so sure I'm willing to recognize a distinction between "Specimen" and "MaterialSample". There are so many non-congruent definitions for "Specimen" that I feel there is little to be gained by acknowledging it as something distinct in some way from MaterialSample.

an observation may have as a material sample only the sensor output from a digital camera

I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a MaterialSample in the DwC sense). If feels to me like "here be dragons".

Lots of good food for thought!

Jegelewicz commented 3 years ago

I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" the Organism at the moment of an Occurrence.

Why does "observing" have to be limited to the senses? How do physicist observe a quark? Couldn't the net be the method by which we observe?

I'm not so sure I'm willing to recognize a distinction between "Specimen" and "MaterialSample". There are so many non-congruent definitions for "Specimen" that I feel there is little to be gained by acknowledging it as something distinct in some way from MaterialSample.

Agree. Also as we catalog objects for art, ethnology and historical collections, "specimen" is something we try to avoid. Your grandmother's hair in a locket probably should not be referred to as a "specimen".

an observation may have as a material sample only the sensor output from a digital camera

I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a MaterialSample in the DwC sense). If feels to me like "here be dragons".

Also agree, sort of. See https://github.com/ArctosDB/arctos/issues/2118 I think this is a little murky, BUT thinking about EVIDENCE instead of MaterialSample might make it less so?

deepreef commented 3 years ago

Why does "observing" have to be limited to the senses? How do physicist observe a quark? Couldn't the net be the method by which we observe?

Well... isn't that the line between HumanObservation and MachineObservation? [That's what I was intending to imply with "lenses or microphones or whatever"] If not, then where is that line? Do photons passing through a lens into human eyeballs (e.g., microscope, binoculars, telescope) count as HumanObservation, or MachineObservation? Perhaps that distinction does not need to be maintained?

Also as we catalog objects for art, ethnology and historical collections, "specimen" is something we try to avoid

Same here. We can treat a cultural object exactly the same (informatically) as a biological specimen; and I prefer the term MaterialSample for both.

BUT thinking about EVIDENCE instead of MaterialSample might make it less so?

Yes, my thinking on this is catching up to where @baskaufs was a while ago, which is that "Evidence" represents the relationship between a "token" (MaterialSample, MaterialCitation, media recording, observation, etc.) and an "assertion" (e.g., Occurrence, Identification). I had previously thought of the "Evidence" as the token itself; but now I see it more as a role than an object. (If that makes any sense?)

matdillen commented 3 years ago

I've thought about this a lot as well, and somewhere recently (not sure if in a post on this issue, or somewhere else), I made the point that many collected specimens are observed before they are collected. In our case, we often observe them first, then capture an in-situ image of them, then collect the specimen. I see these as three separate pieces of "evidence" to support the Occurrence, but of course it's only one Occurrence (one Organism, one Event).

However, there are plenty of cases where organisms are collected without first being observed. Think trawls and plankton tows, and insect traps, etc.

The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later. Hence, the Observation of an insect in a trap happens when the insect is taken from that trap or seen inside it. If an insect dies in a trap, is eaten and digested by another insect fallen into the trap and never observed by any entity logging its Occurrence, then there was no Observation event.

Agreed! Hence my frequent references to "Evidence" as a "thing" in our data universe. Technically, though, the physical specimen itself does not represent evidence of the Occurrence. It can certainly serve as evidence of taxonomic Identification; but the actual "evidence" of the occurrence is the data label containing information about the circumstances of how the specimen was extracted from nature. This might seem like splitting hairs, but consider the circumstance when labels of two different specimens of the same taxon accidentally get switched (it happens -- researchers working on a species sometimes return fish specimens to the wrong jar, for example). I suppose in some cases properties of the specimen itself could be used to corroborate the time and/or location of collection, but I imagine that's the exception, rather than the rule.

The data label can be considered part of the specimen, or an additional specimen. This depends on how the objects were created and how they are being curated. However, as you say, information may be mixed up or connected incorrectly at any node of this model.

This reminds me of something else I meant to point out earlier. My understanding of "duplicates" is "more than one MaterialSample derived from the same Organism". I think this concept is used mostly in botanical circles, but I wonder whether its consistently used to mean the more explicit, "more than one MaterialSample derived from the same Organism from the same Event" (i.e., multiple pieces of evidence for the same Occurrence)?

The methodology is not always clear. The definition of a single Organism may also not always be clear (e.g. rhizomous plants, clonal tree groves or massive fungal networks). The most common usage, I think, would be samples collected during the same gathering event and from the same organism - or at least a very similar one. But a single gathering event might also take hours, days or even weeks.

I guess a lot of how we slice this depends on how we define "observation". For example, if I drag a plankton net through the ocean, then dump the contents into alcohol and eventually get around to examining them months later back at the lab, was there ever an "Observation" to serve as evidence in support of an Occurrence? In my view, no. I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" the Organism at the moment of an Occurrence. In cases where I first observe, then photograph, then collect an Organism, I generally don't bother adding a separate record of the "observation" as evidence, figuring that it is superseded by the image and the MaterialSample/Specimen. Generally, I track Observations only when there are no MaterialSample or recorded media available to support the Occurrence.

I think of an Observation as an event where data on an Occurrence gets logged. This can get really tedious and you could divide everything up into countless mini-observations. If this is meaningful to what you are researching and a feasible thing to do, you could log your data that way. But, as you say, people will regularly simplify this model as many complications are unnecessary. In particular, many mini-observations may be redundant.

In practice, many observations will get merged this way. For instance, if you observe, photograph and collect an organism, you may later remember something peculiar about its behavior that is not apparent from its preserved body nor the photograph. You note this additional information on a label or in a publication which covers this Occurrence. Hence, it becomes de facto a part of a larger material sample related to this Occurrence and the distinction of this separate Observation gets lost in time or is considered irrelevant by everyone ever working with this Occurrence.

This is another excellent point, and one I'm going to need to digest a bit more (in the shower/stuck in traffic/staring at my ceiling at night). Certainly an in-situ image can serve as evidence of multiple Occurrences, so I can see the same for MaterialSamples as well. The most obvious/common example in my world would be stomach contents. This also raises the issue of non-human organisms as "collectors", and hence "Agents", and hence indirectly supporting the parity of "Agent" and "Organism" (as discussed elsewhere).

Non-human animals definitely observe Occurrences, but the question is how they can log that information. If we can communicate with them like we communicate among humans or with machines, then that model would work.

I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a MaterialSample in the DwC sense). If feels to me like "here be dragons".

And it's said that where there be dragons, there be treasure. I agree that we have enough going on not to open this discussion, but fundamentally to me there is no difference between digital data about an Occurrence and physical data. There are (currently) limitations to how we can represent physical data digitally (and vice versa), but this is not a theoretical hard distinction. A bit stream is 'simply' a very versatile, easily manageable and easily replicable representation of anything physical.

Lots of good food for thought!

Thank you!

cboelling commented 3 years ago

What this thread shows to me is that representational primitives in a schema don't function in isolation and how important it is to match expectations of what a given element of a schema represents with the formal definition and the designated label for that element (the term itself).

@campmlc :

But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping.

place+time = Event. So when I collect a bee that has pollen and three parasites, I would:

* Create one `Event` instance (place+time)

* Create five `Organism` instances (one bee, one plant/pollen, three different parasites)

* Create five `Occurrence` instances (one for each of the `Event`+`Organism` pairings)

Let's say the parasites are 3 mites (from one or more different species).

What gets lost in this representation goes in my opinion even one step further than @campmlc pointed out above: the fact that the bee, pollen and mites formed, when first observed, a physically connected object. And the actual nature of this physical connectedness, as it was observed, leads us (from a large body of related observations) to conclude that there are certain functional relations between the bee and the pollen (the bee actively collected the pollen) and the bee and the mites (the mites are parasites of the bee). This is especially interesting if, for example, that kind of pollen or that kind of mite is observed for the first time on that kind of bee (or, in the case of collection specimen, one of them has gone extinct in the meantime).

The existence of an occurrence in the above sense of a particular organism is implicated in a particular Event follows logically from the fact that the object that the organism was part of is implicated in the Event. If this is all that is of interest then this representation might be adequate. But I would argue that it is insufficient as it fails to capture findings about the world that are of interest for a multitude of purposes.

Also, depending on the actual definition of Organism (the application of which might present its own set of problems, I agree on this with @matdillen) the individual pollen grains might account for individual instances of Organism in this example.

I would also question the concept of Event as a place and time - I rather see events as processes which unfold in a particular spatio-temporal region and which have various participants (the bee, the collector, the malaise trap) and which can have other processes as proper parts.

My bottom-line is this: samples collected in the field, sub-samples, collection specimens may all in actuality contain innumerable individual organisms or parts of them. While sometimes the assembly isn't of interest, it is important in others (or may become important - we started analyzing pollen on bumblebees 150 years after these were collected). One way to represent this is to acknowledge that generally we deal with physical entities of some sort (specimens, samples, material samples - whatever distinctions need to be made and whether that's in the field or in a collection) a part of which can be identified as (part or whole) of a particular organism. This is what @dshorthouse also alluded to earlier.

Regarding the relation between MaterialSample and Organism and the great thought experiment @deepreef put forward I think that sentences like "This is a dead parrot." indicate that organisms continue to exist after they're dead :-)

I would argue that some physical entities (Organisms), at a given point in time, are alive (or can have living parts - possibly of more than one organism). Other physical entities are clearly not alive. In each of these cases, I can capture that quality, if need be. Biological organisms (similar cases could be made for the collection of non-living material, e.g. fossils or bird nests) are collected and their physical substance, through a succession of processes after initial collection is transformed into something refered to as Specimen or MaterialSample (possibly many, possible in subsequent stages, physical entities nonetheless). At some point it may become meaningless to consider that entity an Organism anymore. There may be numerous cases where the decision if alive or not is difficult, but I'm not sure I have a use case at hand where that distinction must be made in every case in order to achieve an informative representation. If it must be made in DwC then this could, from my perspective, point to the need to revise these concepts and/or the design patterns in which they are jointly used.

Jegelewicz commented 3 years ago

Well... isn't that the line between HumanObservation and MachineObservation? [...] Perhaps that distinction does not need to be maintained?

It probably doesn't - all observations (that we are talking about) are human eventually as we are interpreting whatever the "machine" observed and have no way of knowing what the "machine" itself observed.

The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later. Hence, the Observation of an insect in a trap happens when the insect is taken from that trap or seen inside it. If an insect dies in a trap, is eaten and digested by another insect fallen into the trap and never observed by any entity logging its Occurrence, then there was no Observation event.

Schrödinger's cat anyone? But yes, however....

Non-human animals definitely observe Occurrences, but the question is how they can log that information. If we can communicate with them like we communicate among humans or with machines, then that model would work.

I kinda have an issue with the implied definition of "communication". In the example provided (stomach contents), the animal "logs" the observation with the evidence collected in it's stomach. Writing stuff down or speaking are not the only methods of communication.

deepreef commented 3 years ago

@matdillen :

The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later.

I understand where you're coming from -- but I still am reluctant to treat everything as an Observation in the sense of DwC (HumanObservation, MachineObservation). When I sort through that plankton sample back at the lab, I don't want to anchor its Occurrence at a depth of several meters along a transect-line out in the ocean on the day that the plankton sample was extracted from nature to an "Observation", because I didn't observe it several meters deep out in the ocean. I can only infer that it occurred at that depth, on that transect. I want to anchor it to a gathering event that did not include any observed organisms at the time and place of interest.

Similarly for the pollen on the collected bee, I don't want to create an Occurrence representing the bee's observing the pollen at the time it was extracted from the flower; but I do want to infer the presence of that species of flower within some radius and time-frame associated with the Event where the bee was extracted from nature. Factually, I can only say that a derivative of the flower (i.e., the pollen) was present at the Event where the bee was collected. As with the case of the cougar blood collected in the stream, does that mean that the plant Organism from which the pollen was collected simultaneously out in the field where flower is and also on the bee tens of meters away (i.e., two separate Events at the same time)? Or would I create a separate Event (with larger coordinateUncertaintynMeters) to represent the likely place/time where the flower was when the bee gathered the pollen? This gets right to the heart of my question about when an Organism becomes a MaterialSample. In my current thinking, the plant Organism was simultaneously both where the flower was and where the pollen was at the time the bee was collected (in the same way that the cougar, as an Organism was on the river bank eating a fish and was also present down stream as blood when the water sample was collected). But maybe that's the wrong way to look at it?

@cboelling :

Let's say the parasites are 3 mites (from one or more different species).

In my example you quoted, I specifically intended the three parasites to be three different species (otherwise they could be collapsed into a single Organism instance). But it doesn't really matter.

What gets lost in this representation goes in my opinion even one step further than @campmlc pointed out above: the fact that the bee, pollen and mites formed, when first observed, a physically connected object. And the actual nature of this physical connectedness, as it was observed, leads us (from a large body of related observations) to conclude that there are certain functional relations between the bee and the pollen (the bee actively collected the pollen) and the bee and the mites (the mites are parasites of the bee).

I agree this is super important and useful information, but when recording these connections, are they represented as relationships among dwc:Organism instances, dwc:MaterialSample instances, or dwc:Occurrence instances? It seems to me that what makes the relationships interesting are with respect to the Organisms; but probably the most explicit way to represent these relationships is as among the associated Occurrences (capturing not just the relationships among the Organisms, but the context in terms of place and time of those relationships). This is another class of information that MaterialSamples can serve as evidence to support. In other words, a particular MaterialSample not only can serve as evidence of the existence of an Occurrence, and the taxonomic identity of an Organism, but also the relationship (beyond just co-occurrence in space and time) among a set of multiple Organisms. This will not always be the case, as the nature of the connectedness of the different Organisms in this example has different implications than other multi-organism MaterialSample instances (e.g., water samples, or a "lot" of specimens, which tells you little more about the associations among the organisms than co-occurrence in space and time).

In any case, I wouldn't say this kind of information is "lost" in this example; rather the discussion was primarily about parsing out instances of Occurrence, Organism and MaterialSample. Once that is sorted out, we're then able to start adding relationships among these instances to capture the interesting inferences about connectedness among the represented instances.

Also, depending on the actual definition of Organism (the application of which might present its own set of problems, I agree on this with @matdillen) the individual pollen grains might account for individual instances of Organism in this example.

Yes -- assuming the aggregated pollen came from more than one plant "whole organism". But part of the reason dwc:Organism is defined to accommodate multiple individuals of the same taxon is to avoid forcing the parsing of individuals (especially when the boundaries between individuals are unclear). Of course, if more than one species of plant is represented among the pollen, then it would be necessary to establish at least one instance of Organism for each species of plant.

I would also question the concept of Event as a place and time - I rather see events as processes which unfold in a particular spatio-temporal region and which have various participants (the bee, the collector, the malaise trap) and which can have other processes as proper parts.

I agree. I use the equation "Event=Place+Time" as short-hand; but in reality it also involves other properties as well, and incorporates a process (e.g., samplingProtocol, etc.).

I think that sentences like "This is a dead parrot." indicate that organisms continue to exist after they're dead :-)

I tend to agree -- the Organism doesn't cease to exist when it dies. We already must accommodate the contemporaneous existence of Organisms and their derived MaterialSamples (e.g., a living tree that is resampled multiple times while it continues to live) -- so I see no reason why this can't extend beyond death. In other words, the parrot can continue to persist as an Organism, even if its only non-disintegrated manifestation is as a preserved skin in a Museum.

Where I'm still a little fuzzy is in dis-associated components derived from the same Organism (i.e., the cougar on the river bank and its blood downstream; the flower in the field and its pollen on a bee; a basking shark swimming through the ocean and its DNA picked up in a water sample; the dinosaur in the forest, and its fossilized remains collected millions of years later, etc.) In other words: We know that a single Organism can participate in multiple Events (i.e., multiple Occurrences) at different times, but can a single Organism participate in multiple Occurrences at different places at the same time? I think the answer will not come from philosophical thought experiments, but rather from practical need.

On a final note, I realize that many people will find this discussion excessive in length/volume, but I am definitely benefitting from it, and VERY MUCH appreciate that I am not the only one who has wrestled with these questions!

matdillen commented 3 years ago

@Jegelewicz

I kinda have an issue with the implied definition of "communication". In the example provided (stomach contents), the animal "logs" the observation with the evidence collected in it's stomach. Writing stuff down or speaking are not the only methods of communication.

My point is not to have a strict definition of communication in general. Non-human animals do communicate and some can definitely communicate occurrences of other organisms to each other. The problem is that Darwin Core is a standard designed by and to be used by humans. The data in it will inevitably be human interpretations of the biological world. Hence, the occurrences that can be inferred by observing the stomach contents of another animal are human interpretations of these samples. A human observes past occurrences using parts of the studied animal as a proxy. You could put an observation event in between, which is the encounter of the animal that lead to samples ending up in its stomach, but that information too will be a human interpretation and often part of the observation where a human looked at the stomach content.

@deepreef

I understand where you're coming from -- but I still am reluctant to treat everything as an Observation in the sense of DwC (HumanObservation, MachineObservation). When I sort through that plankton sample back at the lab, I don't want to anchor its Occurrence at a depth of several meters along a transect-line out in the ocean on the day that the plankton sample was extracted from nature to an "Observation", because I didn't observe it several meters deep out in the ocean. I can only infer that it occurred at that depth, on that transect. I want to anchor it to a gathering event that did not include any observed organisms at the time and place of interest.

Why is this inference not an observation? Many scientific disciplines use complex, indirect methods to observe what is happening.

The gathering event is, essentially, a method or protocol, or an instance of its implementation. These can be modeled separately, but the key unit that we are interested in from this Darwin Core perspective is the Occurrence. The differentiation with Observations that I'm thinking of is a method to address Occurrence ambiguity, in particular connecting Occurrences to evidence for them.

Similarly for the pollen on the collected bee, I don't want to create an Occurrence representing the bee's observing the pollen at the time it was extracted from the flower; but I do want to infer the presence of that species of flower within some radius and time-frame associated with the Event where the bee was extracted from nature.

Yes. We have an Occurrence of a bee and an Occurrence of pollen. Both are tied to the same Material Sample, through the Observations that are the collecting of this sample and/or the study of it.

As with the case of the cougar blood collected in the stream, does that mean that the plant Organism from which the pollen was collected simultaneously out in the field where flower is and also on the bee tens of meters away (i.e., two separate Events at the same time)? Or would I create a separate Event (with larger coordinateUncertaintynMeters) to represent the likely place/time where the flower was when the bee gathered the pollen? This gets right to the heart of my question about when an Organism becomes a MaterialSample. In my current thinking, the plant Organism was simultaneously both where the flower was and where the pollen was at the time the bee was collected (in the same way that the cougar, as an Organism was on the river bank eating a fish and was also present down stream as blood when the water sample was collected). But maybe that's the wrong way to look at it?

It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess. Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.

deepreef commented 3 years ago

@matdillen :

Why is this inference not an observation?

Geez... now you've forced me to actually think about what I think about this! :) OK, short answer, I guess, is "Because that's how I've always defined the term 'observation' in my own mind." I don't think physicists (as humans) have ever observed subatomic particles; they just infer their existence from lots of data. Most of those data come from what I guess many people would classify as MachneObservation -- which of course is still "observation", so I still don't have a good counterpoint to your main point here. More showers/traffic jams/lying awake at night needed for me here, I think.

It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess.

Yes! Exactly! This gets right to the heart of my struggles with the boundary between Organism and MaterialSample; and the "scope" of Organism -- both on terms of lifespan, and in terms of whether the pollen/blood is within scope of the Organism instance of the flower/cougar, or represents a distinct Organism(???), or represents some sort of derivative of the Organism (in a MaterialSample sense???). These may be edge cases in the universe of biodiversity informatics, but they do matter (and probably will matter increasingly going forward).

Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.

Again, 100% agreement. Maybe there is no solution, but I still feel like we can improve our collective understanding of this stuff. Perhaps we can at least come to some consensus on the limits of where consensus can be achieved.

Jegelewicz commented 3 years ago
It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess.

Yes! Exactly! This gets right to the heart of my struggles with the boundary between Organism and MaterialSample; and the "scope" of Organism -- both on terms of lifespan, and in terms of whether the pollen/blood is within scope of the Organism instance of the flower/cougar, or represents a distinct Organism(???), or represents some sort of derivative of the Organism (in a MaterialSample sense???). These may be edge cases in the universe of biodiversity informatics, but they do matter (and probably will matter increasingly going forward).

I agree that we need a definition of organism. https://en.wikipedia.org/wiki/Organism might be a place to start

In biology, an organism (from Greek: ὀργανισμός, organismos) is an entity capable of carrying on life functions.

I would say that for the cougar example, the "things" that are cataloged (blood in the water, image of cougar) are evidence of an occurrence of an organism. Maybe they are the same organism, maybe they are not. As for the pollen, under the definition above, I would say it is a material sample or part of an organism. Unless the pollen can obtain nutrients, create waste products and reproduce on it's own, it doesn't fit the definition above. Other definitions may be better, but we have to start somewhere.

Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.

Again, 100% agreement. Maybe there is no solution, but I still feel like we can improve our collective understanding of this stuff. Perhaps we can at least come to some consensus on the limits of where consensus can be achieved.

Not always! periodic sampling of blood from wolves in a breeding program or zoo animals make it much more certain that you are sampling a single, individual organism (human error not withstanding).

deepreef commented 3 years ago

I agree that we need a definition of organism. https://en.wikipedia.org/wiki/Organism might be a place to start

I would prefer to start here: "A particular organism or defined group of organisms considered to be taxonomically homogeneous."

I would say that for the cougar example, the "things" that are cataloged (blood in the water, image of cougar) are evidence of an occurrence of an organism. Maybe they are the same organism, maybe they are not. As for the pollen, under the definition above, I would say it is a material sample or part of an organism.

I agree, but are there any cases where we want to track parts of Organisms that are not cataloged/curated (e.g., in nature)? I guess more generally, do MaterialSample instances participate in Occurrences only via a representation of an Organism instance? For example, we mint a new organismID for each taxonomic entity identified within a water sample, and then have that Organism instance participate in the Occurrence associated with the water sample collecting Event (even if the presence of the Organism at the Event was only some DNA material in the water)? That, to me, seems like the best balance of "ideal data model" and "practical implementation", but I may not be the best person to judge that balance.

Jegelewicz commented 3 years ago

do MaterialSample instances participate in Occurrences only via a representation of an Organism instance?

I was just going to add that ALL of the "things" we have in collections are MaterialSample(s) of Organisms - we NEVER have the whole thing because Organisms have a life over time and capturing the entirety of that is not possible for mere humans.