Closed Jegelewicz closed 1 year ago
Second side note, I agree this is where the problem rears:
I believe that one important source of problems comes from the requirement to shoehorn all these things (MaterialSample, PreservedSpecimen, HumanObservation, etc) into an Occurrence to enable publication in GBIF.
As has been noted elsewhere the other problem is "what is an identifier"? When we multiple these two problems the combinatorics of the issues (A = shoehorn X B = what kind of identifier) it's ugly.
One thing that might, maybe, help, is to have an ontology of Identifier types, and to indicate what identifier type your identifier is. We do this in TaxonWorks, but we don't have the field in DwC to express this (I think). This would allow aggregators or others to act accordingly. Is this a "global" identifier, then I can build some functionality with certain assumptions. Is this a "local" identifier, well I better be more cautious. Is this a "physical" identifier (e.g. paper label on a specimen), I can infer some other things? Is this a "digital" identifier, oh, then I shouldn't look for it on a piece of paper (or should I?). If this is developed, we might mitigate, slightly, the issues at hand?
Wow. I had a busy day yesterday and could tell from the flood of notifications that this conversation was going on. I have many comments/responses to what has been said in this thread, but don't have time to write them all given that this is the start of another work day for me that is mostly unrelated to TDWG business. But I will record a few thoughts
Despite the disagreement over minutiae in this thread, it is really exciting to me that there is such a high level of agreement on how people are viewing the relationships among biodiversity-related entities. To put this in historical perspective, my interest in this topic dates back to 2010, when I published this paper making the case (somewhat heretical at the time) that organisms should have a place in the biodiversity knowledge graph and that we should be linking derived resources to them. Soon after that, we had the marathon tdwg-content email thread summarized here that reminds me a lot of this conversation. Fortunately, this conversation will be better documented, thanks to GitHub.
That conversation had three main consequences:
dwc:Organism
class.dwctype:
namespace and revising the previous (somewhat circular) class definitions into their current definitions (as of the 2014-12-23 version of the dwc: Darwin Core term list).I feel that the Darwin-SW graph model is a pretty good starting point at representing the relationships between classes. I've found it to be pretty congruent with other models as diverse as the 1993 ASC model and the ABCD ontology model. Two features that Darwin-SW includes that aren't as explicitly included in those models is generic, transitive hasDerivative/derivedFrom
properties and hasEvidence/evidenceFor
properties. That is to a large extent what is at the core of this current discussion.
If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation. I don't have the bandwidth right now to be that person, but I would happily participate as a core member.
I recall a session at the TDWG 2013 conference (I think it was the report-out of the VoMaG group) where the topic of creating a "hasEvidence" property came up. I really thought there might have been the impetus for it to happen then, but it fizzled out. At that point, I don't think that people were taking seriously the idea of Linked Data as a real thing and I think most people thought that the existing system worked well enough to handle data about preserved specimens, which were clearly at the center of the biodiversity informatics universe at that time. However, since that time there have been a number of serious attempts to link data (if not actually to use "Linked Data", i.e. RDF). There has also been a proliferation of derived resources (tissues, DNA, sequences), camera trap images, machine observations, iNaturalist and eBird observations, etc. that have made it clear that museum specimens do not have to be the center of the biodiversity universe. So I think the time may be ripe to try again for a Darwin Core "hasEvidence" or "isEvidenceFor" property as well as some more standardized way to indicate the relationship among derived resources.
If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation.
This. I don't think I have the bandwidth right now either, but I would definitely participate and help as much as my bandwidth allows.
One thing that I feel is missing from all of this is - what gets a catalog number? We struggle with this every day as some of the comments above demonstrate. It is this that made me bring up the issue in the first place. Collection managers and curators have been numbering stuff for centuries, but do those numbers function for the purposes of today's science? If not, what schema would be better? Philosophical discussions are indeed stimulating, but we also need concrete methods for others to follow. Let's not lose sight of that!
Before I put this aside for the day, I wanted to make an additional comment about the mechanism for documenting the relationships between resources. I think that the current "fixes": associatedMedia, associatedOccurrences, associatedWhatever are all just Band-Aids that we are using to fix a gaping wound. We are forced to use them because we are stuck with cramming normalized relationships into flat spreadsheets due to the limitations of the star schema system required by DwC-Archives.
I see the current efforts to "fix" the ResourceRelationship class as going a long way towards correcting this deficiency if we can figure out how to use it effectively and in a standardized way. One reason why I'm excited about this "fix" is that it seems possible to define a process by which ResourceRelationship spreadsheet data could be transformed into bona fide Linked Data (preferably in JSON-LD) that could then be pushed into a triple store and queried in an efficient way. That is, of course, contingent on people actually being able to mint and track IRI identifiers for things, which is a difficult nut to crack.
Given the existence of actual Linked Data (i.e. RDF) representations of the ResourceRelationship relationships, it would then be possible to perform a "dumb-down" operation that would replace the ResourceRelationship instances with a single linking property that would directly connect the resources involved. A model of this kind of process can be found in the SKOS model for handling labels. Section 4.3 of the SKOS Primer describes a process by which label instances (described in SKOS-XL) that can have their own provenance and metadata can be transformed into simple SKOS property links (skos:prefLabel
, skos:altLabel
, skos:hiddenLabel
) that can be used to directly link concepts to their labels. (You can see this process in action by examining any of the Getty Thesaurus of Geographic Names item RDF dumps, for example this one) The analogous situation for us would be to collapse ResourceRelationship instances with their own provenance and metadata into simple linking properties like isDerivedFrom
or hasDerivative
. As @camwebb and I describe in Section 3.3.2 of our paper, it then becomes a trivial query to discover all derived resources using the * SPARQL property path operator, if the hasDerivative
property is transitive.
@deepreef will you be sharing your thoughts and processes?
Over the next 4 months, I will be updating the core data model behind our collections data, and one of the specific issues that our CMs need to "fix" is the way we track physical objects in our collections -- i.e., as instances of MaterialSample.
Because we are all struggling with this....
@Jegelewicz I would be interested in aligning with your process but for TaxonWorks. We'll be extending as well (FieldOccurrence), and we just added Extract classes. Maybe a simple toy ontology of classes would help in this regard.
@baskaufs
That is, of course, contingent on people actually being able to mint and track IRI identifiers for things, which is a difficult nut to crack.
It might be getting closer. I think we can handle much of this in TaxonWorks (though I have some doubts about the re-ification process), this because we can stack as many identifiers as needed, including the requisite UUIDs on our instances. If Arctos is going through the same machinations we might have targets from multiple real applications to play with (very) soon.
@mjy Cool! Real data and real applications are always good.
On the subject of UUIDs, in the imaginary process I described of turning ResourceRelationship relationships into bona fide Linked Data, it would not necessarily be required that the identifiers used for the resource relationship IDs be HTTP IRIs. If they were UUIDs, as a part of the mapping/transformation process one could just slap "urn:uuid:" in front of them in accordance with RFC 4122 and voila! they would be valid to use in RDF triples. They would not be dereferenceable, but who cares? In the process I described, they would just be dumped into a graph database for querying and not really exposed to the web anyway. That should make @deepreef happy, since he has traditionally had issues with requiring (potentially non-persistent) HTTP IRIs as globally unique identifiers.
It is interesting to observe that people en masse try to provide their specimens data through a standard for occurrences (DwC) while TDWG actually has a standard for biological collections data (ABCD). Not that this would solve all issues though. As @dagendresen mentioned already, many problems come from the requirement to shoehorn all things (MaterialSample, PreservedSpecimen, HumanObservation, etc) into an Occurrence. The Digital Extended Specimen concept and openDS specification under construction seem to provide an easy solution to some of these problems by having different objects for e.g. specimens, multimedia, measurements, identifications, gathering events, each with their own PID and link these. This separation in classes seems also the direction GBIF want to take in the next few years, and has its roots in the earlier idea to create a TDWG ontology and the vision of Donald Hobern.
Having a standard and implementing the standard are two very different things (that need to come much closer together IMO). For example, nowhere do I see ABCD here https://www.gbif.org/dataset-classes, why? In some ways raising ABCD and future standards together illustrates exactly how the frustrations here emerged, I suspect. If ABCD addresses needs why is it not more ubiquitous? Unless the proposed standards that are upcoming work much closer with the development of the applications/APIs that will use them I see nothing but similar problems coming with them as well.
@baskaufs : THANK YOU for jumping in! I went a bit nuts yesterday and got too frothy in the mouth with my evangelism, but you very nicely brought it back to a practical trajectory (as you always do!)
If we want this discussion to result in real action towards solving the problems raised here, we need someone with sufficient time, commitment, organizational skills, and stamina to scope and convene a task group, badger important stakeholders to participate, organize regular calls, and keep good records. Without that, this will just be another stimulating conversation. I don't have the bandwidth right now to be that person, but I would happily participate as a core member.
Yeah, same here. Obviously, count me among the enthusiastic participants. I may be a little too close to things to take on the prime role. Besides, I am woefully inadequate in the "organizational skills" department. However, if no one else steps up to lead this effort, I would be willing to take it on, starting in a month or so from now.
@Jegelewicz :
will you be sharing your thoughts and processes?
Yes, absolutely! Is this the right place to do it? Most of the "heat" centers on MaterialSample
, so there would be some logic to continuing this discussion under the banner of this issue. Or, perhaps if a task group comes to fruition, that would be the better forum of discussion.
One thing that I feel is missing from all of this is - what gets a catalog number?
Yup, we're struggling with this too. From the perspective of most of our CMs, it's not a "thing" without some human-friendly number slapped on to it. Long ago I came to realize that a catalog number should be treated just like any other property of the "thing", not the "thing" itself. For all kinds of reasons, catalog numbers make for bad primary keys on data tables, and even worse as persistent identifiers. In the case of specimens, they make a lot of sense as useful tags because there's no other easy way to refer to a specimen object semi-uniquely in a human friendly way (e.g., "The fish identified as Aus bus collected by John Smith in the Maldives in October of 1975"). As an aside, this is analogous to what scientific names of organism were like before Linnaeus came along and gave us a much more convenient/consistent system of labelling taxa). So it's not that I think catalog numbers are a "bad" thing -- I think they're great! I just think their utility as unique identifiers is limited, and we shouldn't slap them on things "just because". But this is one of the areas we'll be exploring in the coming months as we forge ahead with our data remodelling effort.
before Linnaeus came along and gave us a much more convenient/consistent system of labelling taxa
ROFL. Do you work with taxonomy?
I just think their utility as unique identifiers is limited, and we shouldn't slap them on things "just because"
I sort of agree, but sometimes it is the thing exposed "just because" someone slapped a catalog number on it that leads to really interesting research...
That should make @deepreef happy, since he has traditionally had issues with requiring (potentially non-persistent) HTTP IRIs as globally unique identifiers.
Yeah.... so, if you think my posts on MaterialSample
are too long, you don't want to get me started on identifiers...
But yeah -- as much as I understand and sympathize with the TBL LOD idea of committing to HTTP IRIs as the common identifier (largely because they are inherently "actionable"), the fundamental concern I have is that they combine dereferencing metadata and identification in the same string. There are lots of reasons why this is (or at least often can be) a "fragile" state of affairs. I won't dive into this here, but if anyone is interested, most of what I wrote here still represents my current thinking.
In any case, I agree with @mjy and @baskaufs (and others) that identifiers are lurking behind these discussions, because they represent the proxies of the conceptual objects we're deliberating here. The first and most important step in minting an identifier for something, is understanding what that "something" actually "is". I think @mjy nailed it with his earlier post about the need to be careful about deprecating classes rather than "changing" their meaning. I think the fundamental problem we have with DwC is that we don't have a clear enough understanding of what each of the main classes means to even know if we're changing them. So perhaps the first step is to lock down more robust definitions. Occurrence
is arguably the most important class in DwC, yet its current definition hinges on the definition of Organism
, and per the cougar example above, we're not clear on whether the trace blood collected downstream from where the cougar ate its fishy lunch constitutes part of the organism, or merely evidence of the organism (and, hence, we're not sure how many Occurrences we need to mint to capture the information we want to capture).
ROFL. Do you work with taxonomy?
OK, I guess I set myself up for that one! :-) But, to be fair, before Linnaeus came up with his system, the "names" that naturalists used for taxa were along the lines of:
and
If you think modern taxonomy/nomenclature is difficult to capture in information systems, imagine trying to keep track of taxa in a structured way if you had to use names like those instead of genera and species. As complex as it is, the fact that the same system of scientific nomenclature has endured for more than a quarter of a millennium has to say something about its utility...
It actually isn't the system that is the problem, it is that we don't document anything well enough.....
It actually isn't the system that is the problem, it is that we don't document anything well enough.....
YES!! VERY well said!!!
BTW, in case anyone thinks I made up those pre-Linnean names, in fact they were both on the same page of the same publication.
@wouteraddink :
The Digital Extended Specimen concept and openDS specification under construction seem to provide an easy solution to some of these problems by having different objects for e.g. specimens, multimedia, measurements, identifications, gathering events, each with their own PID and link these.
Is the DiSSCo GitHub the best place to participate in that discussion? Or is there another forum or email list or something where the main discussion is happening?
@deepreef yes, on https://github.com/DiSSCo/openDS for participation in openDS discussion (still in early development). At the core of openDS is MIDS (minimum information about a digital specimen, being discussed here: https://github.com/tdwg/mids, and the Digital Extended Specimen concept (convergence between digital and extended specimen concepts) has been discussed in the global consultation: https://discourse.gbif.org/t/converging-digital-specimens-and-extended-specimens-towards-a-global-specification-for-data-integration/2394 and is also discussed in regular meetings organised by BCON with participation of DiSSCo, iDigBio, GBIF.
- Related to this, was the whole bird in the freezer an instance of
MaterialSample
, serving as a "parent" of the three derivedMaterialSample
instances (Skin, Tissue, Skeleton)? (perhaps suggesting the need for a new termparentMaterialSampleID
?)
parentMaterialSampleID
has been suggested to GBIF for both the splitting of, say, a bird (skin, bones, etc..) but also for subdivision of environmental samples (soil, water, gut content).
From a CM point of view....Within my CMS I deal with clusters of fossils on a rock slab and other forms of multiple specimens on or attached to a single "holder" (microscope or micropaleo slides, i.e. forams, ostracods, conodonts, etc.). They are "parent" objects that receive a UUID but no catalog number. The children are catalogued specimens (each with UUID's). The parent is a "loanable object", you cannot loan a single specimen from the parent without all of its children. The "slab of rock" has a UUID because it has its own characteristics that can be recorded in ABCD-EFG extensions (geochemical, physical properties) which relate back to each child. It gets complex, this is a huge rabbit hole. That I also record derivative specimens (coal ball peels, serial thin sections through a coral) in similar ways is what I am working on now. These are also parent/child relationships similar to other derivatives (histological or skeletal preps, but probably need a separate use case, i.e. they can be loaned separate from the parent.
Thanks, @RogerBurkhalter -- we have very similar situations (both parent aggregate instances of MaterialSample, and child derived instances).
Perhaps it's time to submit a new issue proposing a new term parentMaterialSampleID
within the MaterialSample
class?
Edit: Note: the link provided by @thomasstjerne to the discussion on GBIF, where @timrobertson100 suggests proposing this term within DwC (which I strongly support, and will submit unless someone else would prefer to submit it).
Since we're talking about MaterialSample
as a hierarchy, it feels like the right time to toss another "grenade" (firecracker?) into this discussion. OK, that's overly dramatic: more like a practical question to see how others deal with the problem I'm about to describe.
Some of our collections assign catalog numbers to whole specimens 1:1 (one number for one specimen), whereas others assign catalog numbers to "lots". With a hierarchical MaterialSample
, this is pretty easy to deal with, because the multiple specimens in a lot (each representing a separate instance of MaterialSample
) can link (via parentMaterialSampleID
) to another instance of MaterialSample
that represents the lot. The catalog number can be attached to the "lot" instance, and the specimens then inherit the catalog number.
Where things get weird (for me, at least) is how to deal with all the lots in our lot-based collections that consist of only a single whole specimen. Specifically, should we assign the catalog number to instances of MaterialSample
where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample
instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?
Of course, this question presumes that we assign some sort of "type" to each MaterialSample
instance (perhaps we need to propose another term for materialSampleType
?) We do this, but maybe that's an artificial classification that isn't really needed. If we don't have a materialSampleType
property, then we obviously would not want to generate two separate MaterialSample
instances. However, I have to believe that people will want to be able to distinguish "lots" from "whole organisms" from "organism parts", from "tissue samples" (etc.). Or, maybe that information is best captured in preparations
?
My question to those following this thread/issue is: How do you deal with MaterialSample
instances when you have lot-based collections, in terms of managing lots consisting of a single specimen?
I hope this makes at least some sense...
Note: I see that @timrobertson100 has encouraged me to propose parentMaterialSampleID
(which I will do tomorrow unless someone else wants to). Would this group also support proposing materialSampleType
?
We have both of these scenarios in our collections: "should we assign the catalog number to instances of MaterialSample where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?"
In our fish collection, the catalog number is assigned to the lot, and each fish within the lot is a part =MaterialSample? of the lot. It can be difficult to then track child samples = tissues for example, of each fish in the lot, and derivative DNA sequences of each fish,back to the source individual organism as a subcomponent of the lot. However, in our genomics collection, a single fish from a lot is split out and given a separate catalog number, linked to the original lot by a "same lot as" relationship to a cataloged lot url. The single cataloged fish then has multiple tissue types = MaterialSamples, e.g. separate vials with fin clip, muscle sample etc, associated with the single catalog item = specimen = organism in this context. Then each part of the fish = MaterialSample can be subsampled for loans, creating child material samples, which then link to sequence data and publications etc. This is more manageable. If we have both these scenarios, then other collections will also. We need the flexibility of working with either.
On Tue, Apr 27, 2021 at 2:03 PM Richard L. Pyle @.***> wrote:
- [EXTERNAL]*
Since we're talking about MaterialSample as a hierarchy, it feels like the right time to toss another "grenade" (firecracker?) into this discussion. OK, that's overly dramatic: more like a practical question to see how others deal with the problem I'm about to describe.
Some of our collections assign catalog numbers to whole specimens 1:1 (one number for one specimen), whereas others assign catalog numbers to "lots". With a hierarchical MaterialSample, this is pretty easy to deal with, because the multiple specimens in a lot (each representing a separate instance of MaterialSample) can link (via parentMaterialSampleID) to another instance of MaterialSample that represents the lot. The catalog number can be attached to the "lot" instance, and the specimens then inherit the catalog number.
Where things get weird (for me, at least) is how to deal with all the lots in our lot-based collections that consist of only a single whole specimen. Specifically, should we assign the catalog number to instances of MaterialSample where there is only one specimen in a lot to the "specimen", and assign the catalog number to the lot in cases of multi-specimen lots? Or, do we normalize on assigning catalog numbers to "lots", and generate two MaterialSample instances for each single-specimen lot (one representing the lot, and another representing the single child specimen)?
Of course, this question presumes that we assign some sort of "type" to each MaterialSample instance (perhaps we need to propose another term for materialSampleType?) We do this, but maybe that's an artificial classification that isn't really needed. If we don't have a materialSampleType property, then we obviously would not want to generate two separate MaterialSample instances. However, I have to believe that people will want to be able to distinguish "lots" from "whole organisms" from "organism parts", from "tissue samples" (etc.). Or, maybe that information is best captured in preparations?
My question to those following this thread/issue is: How do you deal with MaterialSample instances when you have lot-based collections, in terms of managing lots consisting of a single specimen?
I hope this makes at least some sense...
Note: I see that @timrobertson100 https://github.com/timrobertson100 has encouraged me to propose parentMaterialSampleID (which I will do tomorrow unless someone else wants to). Would this group also support proposing materialSampleType?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/dwc/issues/314#issuecomment-827891963, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBEL2XKZTYXNTZDBG73TK4J7RANCNFSM4WOSVQEQ .
We (TaxonWorks) use something that more or less maps 1:1 to materialSampleType that in part reflects the nature of the enumeration (count, as asserted by the curator) of the number of whole organisms, so we would use it. In our case our types are Specimen
(count = 1), Lot
(count > 1), and RangedLot
(count is curator definable into categories with min/max). We have various physical (or once physical) entities (e.g. Extract
, Sequence
) that can be derived from each other and these classes, these types would also be map-able to materialSampleType.
Our OriginRelationship lets us define the parentMaterialSampleID
in a generic way, so we could use that too. [ All terms
sensu the definitions in TaxonWorks, perhaps not as generally used]]
In my CMS (SQL Server, custom), we now use Lots rarely, when I started 23 years ago much of the collection was cataloged as Lots. I now use that for bulk samples, residues, or otherwise objects that lack a determination. These are for internal use and are not currently shared via the IPT, so have not been mapped. The ParentSample (MaterialSample) has several use cases defined by fixed vocabulary for each of three groupings: Single object, Natural groups or Derived groups. Single objects are simply a single specimen (UUID the same as the specimen). Natural groups include: Clusters (fossil-bearing rock rich in abundance, with a UUID), those (fixed vocabulary terms) include: Death assemblage Reefs Transport/deposition, taphonomic clusters Condensed bed Coquina or bonebed Coal balls multiple fossils in Amber Parts and counterparts Articulated vertebrate remains Epibionts Derived groups (usually based on preparations, with a UUID) include (fixed vocabulary terms): Palynomorph slides Diatom slides SEM stubs with: Multiples of a Single taxon from a single Locality Multiples of a Single taxon from multiple Localities Multiple taxa from a single Locality Single Locality from a single Locality Microfossil cavity slides or gridded cavity slides with: Multiples of a Single taxon from a single Locality Multiples of a Single taxon from multiple Localities Multiple taxa from a single Locality Multiple taxa from multiple Localities Coal Ball Peels Microfossil thin sections Serial thin sections of an individual fossil These are most of the combinations were have come up with, for now. I participated in a iDigBio Paleo Digitization Happy Hour last summer where these terms were put forth. We used the term "Artificial" instead of derived, but derived is a much better term. Lately, I have been looking at who (what) to attribute identifications when those are made by machine via deep learning AI/CNN? So much to do.
I need to mention that specimens in the Natural groups receive individual catalog numbers and UUID's, where possible (specimens may be stacked or poorly exposed), while many of the derived group microfossil specimens also receive individual catalog numbers, palynomorphs may not. Serial sections have been cataloged with the Parent catalog number as they represent one individual, appended with a letter or decimal number.
Related issues are Issue #1, Issue #3, Issue #24 (reopened because of renewed interest), Issue #332, Issue #344, Issue #345, Issue #346, and Issue #347.
At the risk of yet more cans of worms (though it was raised to an extent by @mjy), where then do we associate identifications (and their histories) in this discussion of MaterialSample
vs Occurrence
? Can both of these have associated identifications? If so, then the way we construct Darwin Core Archives may unravel.
Take for example a pinned bee as a MaterialSample
with pollen in its corbiculae. If you scrape the pollen off and mount it on a slide, you now have two MatieralSamples
each with different identifications (and subsequent identification histories). We could argue that there is still the one Occurrence
, but we might also argue that there was two (or more) – the bee was the collector of the pollen. Nonetheless, there are divergent determinations for different parts of the MaterialSamples
and we're incapacitated by the star schema of the Darwin Core archive.
@dshorthouse :
At the risk of yet more cans of worms (though it was raised to an extent by @mjy), where then do we associate identifications (and their histories) in this discussion of MaterialSample vs Occurrence? Can both of these have associated identifications? If so, then the way we construct Darwin Core Archives may unravel.
In my mind, the only DwC class to which Identifications
should apply is Organism
. Alas, this is not something that most CMS systems accommodate, but at least logically that's the "thing" to which an assertion about taxonomic identity applies.
I suppose that as long as we have this simple/flat way of sharing all data as instances of Occurrence
, and in most cases the ratio of Occurrence
:MaterialSample
:Organism
is 1:1:1, then it doesn't matter from an implementation perspective. So, as with MeasurementOrClass
or RelatedResource
classes in DwC, perhaps the short-term solution is that the subject of an Identification
instance can be an instance of any one of several different DwC classes (Organism
, MaterialSample
, Occurrence
)?
But for implementation builders, I would strongly encourage that Identification instances link directly to Organism
instances, as represented in the DSW graph.
Take for example a pinned bee as a MaterialSample with pollen in its corbiculae. If you scrape the pollen off and mount it on a slide, you now have two MatieralSamples each with different identifications (and subsequent identification histories). We could argue that there is still the one Occurrence, but we might also argue that there was two (or more) – the bee was the collector of the pollen. Nonetheless, there are divergent determinations for different parts of the MaterialSamples and we're incapacitated by the star schema of the Darwin Core archive.
So... the "right" way to handle this, I think, is that the bee and the pollen represent two different Organism
instances, each with their own taxonomic identity. That means two separate Occurrence
instances as well. But a MaterialSample
can consist of multiple organismns/taxa, so the bee+pollen could be one MaterialSample
as an aggregate of the bee+pollen+any othe parasites/symbionts that the bee happens to carry with it.
And yeah, I'd definitely be down with crediting the bee as the collector of the pollen!
@deepreef I agree with the bee and the pollen as different organism instances, along with all the bees multiple ecto and endoparasites and viruses, each with their own taxonomic identity. But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping. In our system, this linking place+time is the collecting event. I'm still learning the dwc terminology - how would that be captured? As for the MaterialSample, I can see the bee+pollen+parasites being mapped to a single MaterialSample while the bee is on a pin in an insect collection. But the minute someone scrapes off the pollen, puts in on a slide, gives it an ID and perhaps a new catalog number, and puts the slide in a slide box, that becomes another MaterialSample, correct? Ditto for someone pulling off the mite under the bee's wing, sending it to another researcher on loan, who gives it an ID and uses it to generate a DNA sequence? So these would all be additional MaterialSamples ("child parts") of the original bee record, or they could be new MaterialSamples that are their own parent samples to further "children". Am I understanding this correctly? All of these categories are going to split and shift over time into different categories of a tree schema. Which is why we really need to have some sort of overarching "parent" , which is really the place+time+collection object , which may only initially include a single taxon but which in reality, if you include parasites and pollen and viruses which may or may not be split off and identified, includes multiple taxa. "ParentMaterialSample" seems like the wrong word. Maybe "Occurrence" is correct if it can allow for multiple taxa?
@campmlc :
But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping.
place+time = Event
. So when I collect a bee that has pollen and three parasites, I would:
Event
instance (place+time)Organism
instances (one bee, one plant/pollen, three different parasites)Occurrence
instances (one for each of the Event
+Organism
pairings)How MaterialSample
fits into it depends on how we precisely define the boundary between Organism
and MaterialSample
(see my super-long rantings above). The question I think you're asking, which is the same question I am ultimately trying to answer, is: How do we link MaterialSample
instances to Events
? The obvious answer is "via the relevant Occurrence
instance(s)" But the problem is, as you note, if Occurrence
= [Event
] + [Organism
], how do we actually connect a single MaterialSample
(aggregate bee+pollen+3 parasites) to a single Event
? At face value, it would need to pass through five Occurrence instances. But that seems unnecessarily cumbersome. And that is the crux of what I'm trying to wrap my head around: what is the actual relationship between Organism
and MaterialSample
?
I still feel the answer lies in treating MaterialSample
as one of several examples of "Token", as represented in the DSW diagram I keep referring to. But this gets complicated when you have an aggregate MateralSample
extracted from nature in a single Event
, but there are unknown number of Organism
instances represented within the MaterialSample
.
I have some ideas on this, but more discussion is definitely needed.
As for the MaterialSample, I can see the bee+pollen+parasites being mapped to a single MaterialSample while the bee is on a pin in an insect collection. But the minute someone scrapes off the pollen, puts in on a slide, gives it an ID and perhaps a new catalog number, and puts the slide in a slide box, that becomes another MaterialSample, correct?
Yes, that's how I imagine it.
Ditto for someone pulling off the mite under the bee's wing, sending it to another researcher on loan, who gives it an ID and uses it to generate a DNA sequence? So these would all be additional MaterialSamples ("child parts") of the original bee record, or they could be new MaterialSamples that are their own parent samples to further "children".
Yes -- there can be n-number of "generations" in a MaterialSample
parent-child lineage (i.e., fleas upon fleas upon fleas, etc.)
Which is why we really need to have some sort of overarching "parent" , which is really the place+time+collection object ,
Yes -- I think that part is manageable. As @dshorthouse mentioned in a related context, there is subjectivity in the edge cases for splitting up the various MaterialSample
instances (and assigning them to a materialSampleType), but for the most part I don't see a problem with n-tier partitioning and/or aggregating. The tricky part (as discussed above) is how the Event
data get linked to the MaterialSample
instances.
fleas upon fleas upon fleas, etc
Ah, reminds me of campfires with my dad and his guitar....
There's a flea on the fly on the wart on the frog on the knot on the log in the hole in the bottom of the sea....
There's a flea on the fly on the wart on the frog on the knot on the log in the hole in the bottom of the sea....
Ha! I remember that one as well! I used to LOVE it as a kid (still do, but that's because I'm still a kid in most respects). It should become the anthem for MaterialSample
.
@tucotuco What's the standard for kicking off a Task Group?
The process is outlined in the Task Groups section of the TDWG Process document. The first task is to create a charter for the group. An example of a charter for one Task Group with a Darwin Core vocabulary enhancement that has just successfully achieved its goals is that for the Chronometric Age Extension. Two more for currently active vocabulary enhancements for Darwin Core are Humboldt Core and OSR - How Did It Die?. Task Group charters are linked at the bottom of the parent Interest Group page, such as that for the Observations & Specimens Interest Group and for the Earth Sciences and Paleobiology.
A Task Group on this subject should take a serious look at the Semantic Sensor Network Ontology, and the sosa:Sample in particular.
I've been reading through this thread and it has been a lot to digest. Still, it got me thinking on what the relationship actually is between the physical biological specimens we curate and the occurrences of organisms they represent. One element that I seem to be missing in these discussions is the Observation
.
There has been a lot of discussion about the Event
of an Organism
occurring at a certain time and place. This Occurrence
is what we try to connect to our Specimens
. But it seems to me that there is a key node in between: the Observation
of this Occurrence
. A physical Specimen
can not possibly be connected to an Occurrence
without an Observation
taking place. A physical Specimen
implies a record in some shape or format of this Observation
of an Occurrence
, be that record the whole organism dried and stuck on a sheet of paper, a blood sample of the organism or even a drawing of it. Observations
in this sense can be made by human agents, but also by drones or automated sampling machines.
An Observation
of an Occurrence
does not have to coincide in space and time with that Occurrence
. For instance, one may observe an animal footprint and deduce the occurrence of that animal earlier. Also, one may observe a fossil and deduce the occurrence of that organism a long time ago. One may observe a drowned rare bird and deduce its occurrence earlier in another less wet location.
This solves some of the ambiguity problems, as multiple Observations can record the same Occurrence of an Organism. Different Specimens can connect to a single Observation of an Occurrence, and constitute evidence for this Observation. Specimens can be samples or duplicates of other Specimens. A single Observation can record multiple occurring Organisms.
Specimens can then be connected to an Observation in various ways. That is, the Specimen constitutes
I'm not sure about the distinction for 'significant modification'. This is in part the difference between living and nonliving (preserved), but it's more complicated than that in practice. Is a piece of fur, a shark tooth or some birch sap living or preserved? An extra distinction between organism parts and organism products may be helpful here, but is a bit of a can of worms itself.
Applying this to the example of a pinned bee with 3 parasites and pollen, we get:
If we construe the bee collecting the pollen as an observation event itself, then we have a material sample that connects to multiple observation events. The observation in this case is not the bee collecting the pollen, but the observed pollen attached to the bee providing evidence for the pollen being collected by the bee earlier. This can also happen if we sample a plant damaged by deer or a whale with squid scars. In the same way, a fossil sample represents both the observation of the occurrence of a fossilized organism and the observation of the occurrence of an organism a long time ago.
The relationship between the observation event and the sample is direct: the sample is a product of the event. There is also some ambiguity with regards to specimen vs material sample. I like the definition of specimen being directly tied to curation, whereas a material sample is any physical object that is the result from an observation or the mutation of another sample. Hence, a specimen is a material sample, but a material sample may not be a specimen. This is particularly relevant when considering digital specimens: an observation may have as a material sample only the sensor output from a digital camera. This output is almost immediately digitized and otherwise lost. The digital recording may be curated as the recording of an observation (and hence evidence for it), in which case it is a digital specimen.
I know I've added another wall of text to an already extremely long discussion and I apologize for that, but I felt it important to get my thoughts somewhat in order and do a sanity check of whether this could be helpful.
@matdillen :
A physical Specimen can not possibly be connected to an Occurrence without an Observation taking place.
I've thought about this a lot as well, and somewhere recently (not sure if in a post on this issue, or somewhere else), I made the point that many collected specimens are observed before they are collected. In our case, we often observe them first, then capture an in-situ image of them, then collect the specimen. I see these as three separate pieces of "evidence" to support the Occurrence
, but of course it's only one Occurrence
(one Organism
, one Event
).
However, there are plenty of cases where organisms are collected without first being observed. Think trawls and plankton tows, and insect traps, etc.
A physical Specimen implies a record in some shape or format of this Observation of an Occurrence, be that record the whole organism dried and stuck on a sheet of paper, a blood sample of the organism or even a drawing of it.
Agreed! Hence my frequent references to "Evidence" as a "thing" in our data universe. Technically, though, the physical specimen itself does not represent evidence of the Occurrence
. It can certainly serve as evidence of taxonomic Identification
; but the actual "evidence" of the occurrence is the data label containing information about the circumstances of how the specimen was extracted from nature. This might seem like splitting hairs, but consider the circumstance when labels of two different specimens of the same taxon accidentally get switched (it happens -- researchers working on a species sometimes return fish specimens to the wrong jar, for example). I suppose in some cases properties of the specimen itself could be used to corroborate the time and/or location of collection, but I imagine that's the exception, rather than the rule.
An Observation of an Occurrence does not have to coincide in space and time with that Occurrence. For instance, one may observe an animal footprint and deduce the occurrence of that animal earlier.
I think this is a really good point, and relates to that earlier example from @dshorthouse with the cougar blood being collected in a water sample downstream.
Specimens can be samples or duplicates of other Specimens
This reminds me of something else I meant to point out earlier. My understanding of "duplicates" is "more than one MaterialSample
derived from the same Organism
". I think this concept is used mostly in botanical circles, but I wonder whether its consistently used to mean the more explicit, "more than one MaterialSample
derived from the same Organism
from the same Event
" (i.e., multiple pieces of evidence for the same Occurrence
)?
Specimens can then be connected to an Observation in various ways. That is, the Specimen constitutes
The bullet list you provide is, I think, very helpful. I went through each example and imagined how I would capture the information with respect to Events
, Organisms
, Occurrences
, and MaterialSamples
-- but I wonder if everyone would arrive at the same conclusions for how to do that.
I guess a lot of how we slice this depends on how we define "observation". For example, if I drag a plankton net through the ocean, then dump the contents into alcohol and eventually get around to examining them months later back at the lab, was there ever an "Observation" to serve as evidence in support of an Occurrence
? In my view, no. I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" the Organism
at the moment of an Occurrence
. In cases where I first observe, then photograph, then collect an Organism
, I generally don't bother adding a separate record of the "observation" as evidence, figuring that it is superseded by the image and the MaterialSample/Specimen. Generally, I track Observations only when there are no MaterialSample or recorded media available to support the Occurrence
.
If we construe the bee collecting the pollen as an observation event itself, then we have a material sample that connects to multiple observation events.
This is another excellent point, and one I'm going to need to digest a bit more (in the shower/stuck in traffic/staring at my ceiling at night). Certainly an in-situ image can serve as evidence of multiple Occurrences
, so I can see the same for MaterialSamples
as well. The most obvious/common example in my world would be stomach contents. This also raises the issue of non-human organisms as "collectors", and hence "Agents", and hence indirectly supporting the parity of "Agent" and "Organism" (as discussed elsewhere).
I like the definition of specimen being directly tied to curation, whereas a material sample is any physical object that is the result from an observation or the mutation of another sample.
I still don't favor this distinction. I think MaterialSample
necessarily involves an element of curation -- even if the "curation" is limited to the original act of collection. That leaves open the question of whether an observed (but untouched) skull in-situ is itself an instance of MaterialSample
, or Organism
, or something else. This, of course, comes back to my initial question of: What is the distinction between an Organism
instance and a MaterialSample
instance. I think "curation" definitely has something to do with it, but we still need to define that word. I'm not so sure I'm willing to recognize a distinction between "Specimen" and "MaterialSample". There are so many non-congruent definitions for "Specimen" that I feel there is little to be gained by acknowledging it as something distinct in some way from MaterialSample
.
an observation may have as a material sample only the sensor output from a digital camera
I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a MaterialSample
in the DwC sense). If feels to me like "here be dragons".
Lots of good food for thought!
I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" the Organism at the moment of an Occurrence.
Why does "observing" have to be limited to the senses? How do physicist observe a quark? Couldn't the net be the method by which we observe?
I'm not so sure I'm willing to recognize a distinction between "Specimen" and "MaterialSample". There are so many non-congruent definitions for "Specimen" that I feel there is little to be gained by acknowledging it as something distinct in some way from MaterialSample.
Agree. Also as we catalog objects for art, ethnology and historical collections, "specimen" is something we try to avoid. Your grandmother's hair in a locket probably should not be referred to as a "specimen".
an observation may have as a material sample only the sensor output from a digital camera
I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a MaterialSample in the DwC sense). If feels to me like "here be dragons".
Also agree, sort of. See https://github.com/ArctosDB/arctos/issues/2118 I think this is a little murky, BUT thinking about EVIDENCE instead of MaterialSample might make it less so?
Why does "observing" have to be limited to the senses? How do physicist observe a quark? Couldn't the net be the method by which we observe?
Well... isn't that the line between HumanObservation
and MachineObservation
? [That's what I was intending to imply with "lenses or microphones or whatever"] If not, then where is that line? Do photons passing through a lens into human eyeballs (e.g., microscope, binoculars, telescope) count as HumanObservation
, or MachineObservation
? Perhaps that distinction does not need to be maintained?
Also as we catalog objects for art, ethnology and historical collections, "specimen" is something we try to avoid
Same here. We can treat a cultural object exactly the same (informatically) as a biological specimen; and I prefer the term MaterialSample
for both.
BUT thinking about EVIDENCE instead of MaterialSample might make it less so?
Yes, my thinking on this is catching up to where @baskaufs was a while ago, which is that "Evidence" represents the relationship between a "token" (MaterialSample
, MaterialCitation
, media recording, observation, etc.) and an "assertion" (e.g., Occurrence
, Identification
). I had previously thought of the "Evidence" as the token itself; but now I see it more as a role than an object. (If that makes any sense?)
I've thought about this a lot as well, and somewhere recently (not sure if in a post on this issue, or somewhere else), I made the point that many collected specimens are observed before they are collected. In our case, we often observe them first, then capture an in-situ image of them, then collect the specimen. I see these as three separate pieces of "evidence" to support the
Occurrence
, but of course it's only oneOccurrence
(oneOrganism
, oneEvent
).However, there are plenty of cases where organisms are collected without first being observed. Think trawls and plankton tows, and insect traps, etc.
The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later. Hence, the Observation of an insect in a trap happens when the insect is taken from that trap or seen inside it. If an insect dies in a trap, is eaten and digested by another insect fallen into the trap and never observed by any entity logging its Occurrence, then there was no Observation event.
Agreed! Hence my frequent references to "Evidence" as a "thing" in our data universe. Technically, though, the physical specimen itself does not represent evidence of the
Occurrence
. It can certainly serve as evidence of taxonomicIdentification
; but the actual "evidence" of the occurrence is the data label containing information about the circumstances of how the specimen was extracted from nature. This might seem like splitting hairs, but consider the circumstance when labels of two different specimens of the same taxon accidentally get switched (it happens -- researchers working on a species sometimes return fish specimens to the wrong jar, for example). I suppose in some cases properties of the specimen itself could be used to corroborate the time and/or location of collection, but I imagine that's the exception, rather than the rule.
The data label can be considered part of the specimen, or an additional specimen. This depends on how the objects were created and how they are being curated. However, as you say, information may be mixed up or connected incorrectly at any node of this model.
This reminds me of something else I meant to point out earlier. My understanding of "duplicates" is "more than one
MaterialSample
derived from the sameOrganism
". I think this concept is used mostly in botanical circles, but I wonder whether its consistently used to mean the more explicit, "more than oneMaterialSample
derived from the sameOrganism
from the sameEvent
" (i.e., multiple pieces of evidence for the sameOccurrence
)?
The methodology is not always clear. The definition of a single Organism may also not always be clear (e.g. rhizomous plants, clonal tree groves or massive fungal networks). The most common usage, I think, would be samples collected during the same gathering event and from the same organism - or at least a very similar one. But a single gathering event might also take hours, days or even weeks.
I guess a lot of how we slice this depends on how we define "observation". For example, if I drag a plankton net through the ocean, then dump the contents into alcohol and eventually get around to examining them months later back at the lab, was there ever an "Observation" to serve as evidence in support of an
Occurrence
? In my view, no. I think of "Observations" as more direct humans (eyes, ears, potentially smell, taste?) or lenses or microphones or whatever directly "observing" theOrganism
at the moment of anOccurrence
. In cases where I first observe, then photograph, then collect anOrganism
, I generally don't bother adding a separate record of the "observation" as evidence, figuring that it is superseded by the image and the MaterialSample/Specimen. Generally, I track Observations only when there are no MaterialSample or recorded media available to support theOccurrence
.
I think of an Observation as an event where data on an Occurrence gets logged. This can get really tedious and you could divide everything up into countless mini-observations. If this is meaningful to what you are researching and a feasible thing to do, you could log your data that way. But, as you say, people will regularly simplify this model as many complications are unnecessary. In particular, many mini-observations may be redundant.
In practice, many observations will get merged this way. For instance, if you observe, photograph and collect an organism, you may later remember something peculiar about its behavior that is not apparent from its preserved body nor the photograph. You note this additional information on a label or in a publication which covers this Occurrence. Hence, it becomes de facto a part of a larger material sample related to this Occurrence and the distinction of this separate Observation gets lost in time or is considered irrelevant by everyone ever working with this Occurrence.
This is another excellent point, and one I'm going to need to digest a bit more (in the shower/stuck in traffic/staring at my ceiling at night). Certainly an in-situ image can serve as evidence of multiple
Occurrences
, so I can see the same forMaterialSamples
as well. The most obvious/common example in my world would be stomach contents. This also raises the issue of non-human organisms as "collectors", and hence "Agents", and hence indirectly supporting the parity of "Agent" and "Organism" (as discussed elsewhere).
Non-human animals definitely observe Occurrences, but the question is how they can log that information. If we can communicate with them like we communicate among humans or with machines, then that model would work.
I wouldn't go there (i.e., regarding a pattern of 1s and 0s as a
MaterialSample
in the DwC sense). If feels to me like "here be dragons".
And it's said that where there be dragons, there be treasure. I agree that we have enough going on not to open this discussion, but fundamentally to me there is no difference between digital data about an Occurrence and physical data. There are (currently) limitations to how we can represent physical data digitally (and vice versa), but this is not a theoretical hard distinction. A bit stream is 'simply' a very versatile, easily manageable and easily replicable representation of anything physical.
Lots of good food for thought!
Thank you!
What this thread shows to me is that representational primitives in a schema don't function in isolation and how important it is to match expectations of what a given element of a schema represents with the formal definition and the designated label for that element (the term itself).
@campmlc :
But I'm still confused by how Occurrence and MaterialSample are applied in this example. If Occurrence is place+time+organism, I guess it makes sense to have two occurrences. But the place+time is shared by the bee and the pollen (and parasites)- this is a very important piece of data that seems to get lost in your mapping.
place+time =
Event
. So when I collect a bee that has pollen and three parasites, I would:* Create one `Event` instance (place+time) * Create five `Organism` instances (one bee, one plant/pollen, three different parasites) * Create five `Occurrence` instances (one for each of the `Event`+`Organism` pairings)
Let's say the parasites are 3 mites (from one or more different species).
What gets lost in this representation goes in my opinion even one step further than @campmlc pointed out above: the fact that the bee, pollen and mites formed, when first observed, a physically connected object. And the actual nature of this physical connectedness, as it was observed, leads us (from a large body of related observations) to conclude that there are certain functional relations between the bee and the pollen (the bee actively collected the pollen) and the bee and the mites (the mites are parasites of the bee). This is especially interesting if, for example, that kind of pollen or that kind of mite is observed for the first time on that kind of bee (or, in the case of collection specimen, one of them has gone extinct in the meantime).
The existence of an occurrence
in the above sense of a particular organism
is implicated in a particular Event
follows logically from the fact that the object that the organism was part of is implicated in the Event
. If this is all that is of interest then this representation might be adequate. But I would argue that it is insufficient as it fails to capture findings about the world that are of interest for a multitude of purposes.
Also, depending on the actual definition of Organism
(the application of which might present its own set of problems, I agree on this with @matdillen) the individual pollen grains might account for individual instances of Organism
in this example.
I would also question the concept of Event
as a place and time - I rather see events as processes which unfold in a particular spatio-temporal region and which have various participants (the bee, the collector, the malaise trap) and which can have other processes as proper parts.
My bottom-line is this: samples collected in the field, sub-samples, collection specimens may all in actuality contain innumerable individual organisms or parts of them. While sometimes the assembly isn't of interest, it is important in others (or may become important - we started analyzing pollen on bumblebees 150 years after these were collected). One way to represent this is to acknowledge that generally we deal with physical entities of some sort (specimens, samples, material samples - whatever distinctions need to be made and whether that's in the field or in a collection) a part of which can be identified as (part or whole) of a particular organism. This is what @dshorthouse also alluded to earlier.
Regarding the relation between MaterialSample
and Organism
and the great thought experiment @deepreef put forward I think that sentences like "This is a dead parrot." indicate that organisms continue to exist after they're dead :-)
I would argue that some physical entities (Organisms
), at a given point in time, are alive (or can have living parts - possibly of more than one organism). Other physical entities are clearly not alive. In each of these cases, I can capture that quality, if need be. Biological organisms (similar cases could be made for the collection of non-living material, e.g. fossils or bird nests) are collected and their physical substance, through a succession of processes after initial collection is transformed into something refered to as Specimen
or MaterialSample
(possibly many, possible in subsequent stages, physical entities nonetheless). At some point it may become meaningless to consider that entity an Organism
anymore. There may be numerous cases where the decision if alive or not is difficult, but I'm not sure I have a use case at hand where that distinction must be made in every case in order to achieve an informative representation. If it must be made in DwC then this could, from my perspective, point to the need to revise these concepts and/or the design patterns in which they are jointly used.
Well... isn't that the line between HumanObservation and MachineObservation? [...] Perhaps that distinction does not need to be maintained?
It probably doesn't - all observations (that we are talking about) are human eventually as we are interpreting whatever the "machine" observed and have no way of knowing what the "machine" itself observed.
The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later. Hence, the Observation of an insect in a trap happens when the insect is taken from that trap or seen inside it. If an insect dies in a trap, is eaten and digested by another insect fallen into the trap and never observed by any entity logging its Occurrence, then there was no Observation event.
Schrödinger's cat anyone? But yes, however....
Non-human animals definitely observe Occurrences, but the question is how they can log that information. If we can communicate with them like we communicate among humans or with machines, then that model would work.
I kinda have an issue with the implied definition of "communication". In the example provided (stomach contents), the animal "logs" the observation with the evidence collected in it's stomach. Writing stuff down or speaking are not the only methods of communication.
@matdillen :
The reason I think the Observation is so key is that it marks the point where any information related to the Occurrence was somehow logged, so that it can be (re-)assessed later.
I understand where you're coming from -- but I still am reluctant to treat everything as an Observation in the sense of DwC (HumanObservation
, MachineObservation
). When I sort through that plankton sample back at the lab, I don't want to anchor its Occurrence
at a depth of several meters along a transect-line out in the ocean on the day that the plankton sample was extracted from nature to an "Observation", because I didn't observe it several meters deep out in the ocean. I can only infer that it occurred at that depth, on that transect. I want to anchor it to a gathering event that did not include any observed organisms at the time and place of interest.
Similarly for the pollen on the collected bee, I don't want to create an Occurrence
representing the bee's observing the pollen at the time it was extracted from the flower; but I do want to infer the presence of that species of flower within some radius and time-frame associated with the Event
where the bee was extracted from nature. Factually, I can only say that a derivative of the flower (i.e., the pollen) was present at the Event
where the bee was collected. As with the case of the cougar blood collected in the stream, does that mean that the plant Organism
from which the pollen was collected simultaneously out in the field where flower is and also on the bee tens of meters away (i.e., two separate Events
at the same time)? Or would I create a separate Event
(with larger coordinateUncertaintynMeters
) to represent the likely place/time where the flower was when the bee gathered the pollen? This gets right to the heart of my question about when an Organism
becomes a MaterialSample
. In my current thinking, the plant Organism
was simultaneously both where the flower was and where the pollen was at the time the bee was collected (in the same way that the cougar, as an Organism
was on the river bank eating a fish and was also present down stream as blood when the water sample was collected). But maybe that's the wrong way to look at it?
@cboelling :
Let's say the parasites are 3 mites (from one or more different species).
In my example you quoted, I specifically intended the three parasites to be three different species (otherwise they could be collapsed into a single Organism
instance). But it doesn't really matter.
What gets lost in this representation goes in my opinion even one step further than @campmlc pointed out above: the fact that the bee, pollen and mites formed, when first observed, a physically connected object. And the actual nature of this physical connectedness, as it was observed, leads us (from a large body of related observations) to conclude that there are certain functional relations between the bee and the pollen (the bee actively collected the pollen) and the bee and the mites (the mites are parasites of the bee).
I agree this is super important and useful information, but when recording these connections, are they represented as relationships among dwc:Organism
instances, dwc:MaterialSample
instances, or dwc:Occurrence
instances? It seems to me that what makes the relationships interesting are with respect to the Organisms
; but probably the most explicit way to represent these relationships is as among the associated Occurrences
(capturing not just the relationships among the Organisms
, but the context in terms of place and time of those relationships). This is another class of information that MaterialSamples
can serve as evidence to support. In other words, a particular MaterialSample
not only can serve as evidence of the existence of an Occurrence
, and the taxonomic identity of an Organism
, but also the relationship (beyond just co-occurrence in space and time) among a set of multiple Organisms
. This will not always be the case, as the nature of the connectedness of the different Organisms
in this example has different implications than other multi-organism MaterialSample
instances (e.g., water samples, or a "lot" of specimens, which tells you little more about the associations among the organisms than co-occurrence in space and time).
In any case, I wouldn't say this kind of information is "lost" in this example; rather the discussion was primarily about parsing out instances of Occurrence
, Organism
and MaterialSample
. Once that is sorted out, we're then able to start adding relationships among these instances to capture the interesting inferences about connectedness among the represented instances.
Also, depending on the actual definition of Organism (the application of which might present its own set of problems, I agree on this with @matdillen) the individual pollen grains might account for individual instances of Organism in this example.
Yes -- assuming the aggregated pollen came from more than one plant "whole organism". But part of the reason dwc:Organism
is defined to accommodate multiple individuals of the same taxon is to avoid forcing the parsing of individuals (especially when the boundaries between individuals are unclear). Of course, if more than one species of plant is represented among the pollen, then it would be necessary to establish at least one instance of Organism
for each species of plant.
I would also question the concept of Event as a place and time - I rather see events as processes which unfold in a particular spatio-temporal region and which have various participants (the bee, the collector, the malaise trap) and which can have other processes as proper parts.
I agree. I use the equation "Event=Place+Time" as short-hand; but in reality it also involves other properties as well, and incorporates a process (e.g., samplingProtocol
, etc.).
I think that sentences like "This is a dead parrot." indicate that organisms continue to exist after they're dead :-)
I tend to agree -- the Organism
doesn't cease to exist when it dies. We already must accommodate the contemporaneous existence of Organisms
and their derived MaterialSamples
(e.g., a living tree that is resampled multiple times while it continues to live) -- so I see no reason why this can't extend beyond death. In other words, the parrot can continue to persist as an Organism
, even if its only non-disintegrated manifestation is as a preserved skin in a Museum.
Where I'm still a little fuzzy is in dis-associated components derived from the same Organism
(i.e., the cougar on the river bank and its blood downstream; the flower in the field and its pollen on a bee; a basking shark swimming through the ocean and its DNA picked up in a water sample; the dinosaur in the forest, and its fossilized remains collected millions of years later, etc.) In other words: We know that a single Organism
can participate in multiple Events
(i.e., multiple Occurrences
) at different times, but can a single Organism
participate in multiple Occurrences
at different places at the same time? I think the answer will not come from philosophical thought experiments, but rather from practical need.
On a final note, I realize that many people will find this discussion excessive in length/volume, but I am definitely benefitting from it, and VERY MUCH appreciate that I am not the only one who has wrestled with these questions!
@Jegelewicz
I kinda have an issue with the implied definition of "communication". In the example provided (stomach contents), the animal "logs" the observation with the evidence collected in it's stomach. Writing stuff down or speaking are not the only methods of communication.
My point is not to have a strict definition of communication in general. Non-human animals do communicate and some can definitely communicate occurrences of other organisms to each other. The problem is that Darwin Core is a standard designed by and to be used by humans. The data in it will inevitably be human interpretations of the biological world. Hence, the occurrences that can be inferred by observing the stomach contents of another animal are human interpretations of these samples. A human observes past occurrences using parts of the studied animal as a proxy. You could put an observation event in between, which is the encounter of the animal that lead to samples ending up in its stomach, but that information too will be a human interpretation and often part of the observation where a human looked at the stomach content.
@deepreef
I understand where you're coming from -- but I still am reluctant to treat everything as an Observation in the sense of DwC (HumanObservation, MachineObservation). When I sort through that plankton sample back at the lab, I don't want to anchor its Occurrence at a depth of several meters along a transect-line out in the ocean on the day that the plankton sample was extracted from nature to an "Observation", because I didn't observe it several meters deep out in the ocean. I can only infer that it occurred at that depth, on that transect. I want to anchor it to a gathering event that did not include any observed organisms at the time and place of interest.
Why is this inference not an observation? Many scientific disciplines use complex, indirect methods to observe what is happening.
The gathering event is, essentially, a method or protocol, or an instance of its implementation. These can be modeled separately, but the key unit that we are interested in from this Darwin Core perspective is the Occurrence. The differentiation with Observations that I'm thinking of is a method to address Occurrence ambiguity, in particular connecting Occurrences to evidence for them.
Similarly for the pollen on the collected bee, I don't want to create an Occurrence representing the bee's observing the pollen at the time it was extracted from the flower; but I do want to infer the presence of that species of flower within some radius and time-frame associated with the Event where the bee was extracted from nature.
Yes. We have an Occurrence of a bee and an Occurrence of pollen. Both are tied to the same Material Sample, through the Observations that are the collecting of this sample and/or the study of it.
As with the case of the cougar blood collected in the stream, does that mean that the plant Organism from which the pollen was collected simultaneously out in the field where flower is and also on the bee tens of meters away (i.e., two separate Events at the same time)? Or would I create a separate Event (with larger coordinateUncertaintynMeters) to represent the likely place/time where the flower was when the bee gathered the pollen? This gets right to the heart of my question about when an Organism becomes a MaterialSample. In my current thinking, the plant Organism was simultaneously both where the flower was and where the pollen was at the time the bee was collected (in the same way that the cougar, as an Organism was on the river bank eating a fish and was also present down stream as blood when the water sample was collected). But maybe that's the wrong way to look at it?
It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess. Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.
@matdillen :
Why is this inference not an observation?
Geez... now you've forced me to actually think about what I think about this! :) OK, short answer, I guess, is "Because that's how I've always defined the term 'observation' in my own mind." I don't think physicists (as humans) have ever observed subatomic particles; they just infer their existence from lots of data. Most of those data come from what I guess many people would classify as MachneObservation
-- which of course is still "observation", so I still don't have a good counterpoint to your main point here. More showers/traffic jams/lying awake at night needed for me here, I think.
It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess.
Yes! Exactly! This gets right to the heart of my struggles with the boundary between Organism
and MaterialSample
; and the "scope" of Organism
-- both on terms of lifespan, and in terms of whether the pollen/blood is within scope of the Organism
instance of the flower/cougar, or represents a distinct Organism
(???), or represents some sort of derivative of the Organism
(in a MaterialSample
sense???). These may be edge cases in the universe of biodiversity informatics, but they do matter (and probably will matter increasingly going forward).
Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.
Again, 100% agreement. Maybe there is no solution, but I still feel like we can improve our collective understanding of this stuff. Perhaps we can at least come to some consensus on the limits of where consensus can be achieved.
It depends on how you define a single Organism (pollen vs flower) and how well you can disambiguate Occurring Organisms with the data you possess.
Yes! Exactly! This gets right to the heart of my struggles with the boundary between Organism and MaterialSample; and the "scope" of Organism -- both on terms of lifespan, and in terms of whether the pollen/blood is within scope of the Organism instance of the flower/cougar, or represents a distinct Organism(???), or represents some sort of derivative of the Organism (in a MaterialSample sense???). These may be edge cases in the universe of biodiversity informatics, but they do matter (and probably will matter increasingly going forward).
I agree that we need a definition of organism. https://en.wikipedia.org/wiki/Organism might be a place to start
In biology, an organism (from Greek: ὀργανισμός, organismos) is an entity capable of carrying on life functions.
I would say that for the cougar example, the "things" that are cataloged (blood in the water, image of cougar) are evidence of an occurrence of an organism. Maybe they are the same organism, maybe they are not. As for the pollen, under the definition above, I would say it is a material sample or part of an organism. Unless the pollen can obtain nutrients, create waste products and reproduce on it's own, it doesn't fit the definition above. Other definitions may be better, but we have to start somewhere.
Regardless of how we decide to model it, it will always be tricky to assess whether different Observations were made of the same, single Organism.
Again, 100% agreement. Maybe there is no solution, but I still feel like we can improve our collective understanding of this stuff. Perhaps we can at least come to some consensus on the limits of where consensus can be achieved.
Not always! periodic sampling of blood from wolves in a breeding program or zoo animals make it much more certain that you are sampling a single, individual organism (human error not withstanding).
I agree that we need a definition of organism. https://en.wikipedia.org/wiki/Organism might be a place to start
I would prefer to start here: "A particular organism or defined group of organisms considered to be taxonomically homogeneous."
I would say that for the cougar example, the "things" that are cataloged (blood in the water, image of cougar) are evidence of an occurrence of an organism. Maybe they are the same organism, maybe they are not. As for the pollen, under the definition above, I would say it is a material sample or part of an organism.
I agree, but are there any cases where we want to track parts of Organisms that are not cataloged/curated (e.g., in nature)? I guess more generally, do MaterialSample
instances participate in Occurrences
only via a representation of an Organism
instance? For example, we mint a new organismID
for each taxonomic entity identified within a water sample, and then have that Organism
instance participate in the Occurrence
associated with the water sample collecting Event
(even if the presence of the Organism
at the Event
was only some DNA material in the water)? That, to me, seems like the best balance of "ideal data model" and "practical implementation", but I may not be the best person to judge that balance.
do MaterialSample instances participate in Occurrences only via a representation of an Organism instance?
I was just going to add that ALL of the "things" we have in collections are MaterialSample(s) of Organisms - we NEVER have the whole thing because Organisms have a life over time and capturing the entirety of that is not possible for mere humans.
Change term
From https://dwc.tdwg.org/terms/#materialsample
From https://dwc.tdwg.org/terms/#livingspecimen
Given the above, we propose that MaterialSample should be more specific to something less than what might be considered a "voucher" in order to delineate it from PreservedSpecimen.
Proposed new attributes of the term:
Note: all of the above is my interpretation of the Arctos Working Group conversation.