tdwg / material-sample

A Task Group of the Observations and Specimen Records (OSR) Interest Group
2 stars 0 forks source link

Other Deliverable - BasisOfRecord review #11

Closed Jegelewicz closed 2 years ago

Jegelewicz commented 2 years ago

Task Group will make a recommendation [...] as to which class in the Darwin Core standard these properties belong which may also include recommendations for terms being revised, added, disambiguated, or deprecated. Depends upon definitions provided [in primary deliverable]. [...] Recommendations will be provided for a revised formal definition as it pertains to materialSample but will not consider other data types.

Current Darwin Core Placement/Definition

http://rs.tdwg.org/dwc/terms/basisOfRecord

this term is a property of Record-level

Defintion

The specific nature of the data record.

Examples

PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample, Event, HumanObservation, MachineObservation, Taxon, Occurrence, MaterialCitation

Comments

Recommended best practice is to use the standard label of one of the Darwin Core classes.

See also

umbrella issue related to dwc:basisOfRecord and an Evidence class: https://github.com/tdwg/dwc/issues/302

Jegelewicz commented 2 years ago

From https://github.com/tdwg/material-sample/issues/3#issuecomment-904931698

challenge ourselves to include the edges, or relationships, in how we come to define PreservedSpecimen and other ilk we've traditionally dumped in basisOfRecord

Why do we define terms that are meant to be controlled vocabulary for this particular term? Is that done anywhere else in Darwin Core? How did this one term get the privilege of using other terms as controlled vocabulary?

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I decide to use "Object", "Specimen" or "Photo", what happens? Deviation from the Darwin Core class terms will probably mean nobody knows what to do with my records. I guess I feel that we are essentially saying that basisOfRecord requires a Darwin Core Class and I am not certain that is something we want, or is it?

tucotuco commented 2 years ago

Why do we define terms that are meant to be controlled vocabulary for this particular term? Is that done anywhere else in Darwin Core? How did this one term get the privilege of using other terms as controlled vocabulary?

There are actually four Darwin Core terms that recommend a controlled vocabulary of terms also generated and managed by Darwin Core. The basisOfRecord term happened to be the first of them, already in place in the first version of the standard in 2009. The other three terms with controlled vocabularies minted and managed by Darwin Core are establishmentMeans, degreeOfEstablishment, and pathway. There are many other terms in Darwin Core that recommend using a controlled vocabulary, and some of them recommend specific controlled vocabularies. There is also one term adopted by Darwin Core from Dublin Core, dc:type, which requires adherence to a specific vocabulary, the DCMI Type Vocabulary, among which PhysicalObject, Event, StillImage, MovingImage, Sound, Dataset, Collection, and Text are of interest to us in biodiversity.

Now for some history. I had hoped never to have to write "The Sordid History of Darwin Core", but I feel compelled to at least provide a draft of the "Sordid History of basisOfRecord" chapter. Warning, I am going to be thorough.

Darwin Core evolved from an abbreviated schema of terms for the Species Analyst network (2001) meant to be used for sharing information about museum and herbarium specimens. By 2003, there was demand to be able to share observation records as well. It was deemed to be of extreme importance to be able to distinguish specimens from observations. Thus BasisOfRecord was born. It was defined as "An abbreviation indicating whether the record represents an observation (O), living organism (L), specimen (S), germplasm/seed (G), etc." Just four days later it was redefined (this was before being a standard, things were easier to change then), repenting the recommendation for abbreviations that would not be universally interpretable, and the definition became, "A description indicating whether the record represents an observation, tissue sample, living organism, voucher specimen, germplasm/seed, genetic information, etc." Over the next four years a trend toward a vocabulary began to form and in 2007 a new version of the term was minted with the definition, "A descriptive term indicating whether the record represents an object or observation. Examples: PreservedSpecimen- A physical object representing one or more organisms, part of organism, or artifact of an organism. synonyms: voucher, collection, lot. FossilSpecimen- A physical object representing one or more fossil organisms, part of fossil organism, or artifact of a fossil organism. LivingSpecimen- An organism removed from its natural occurrence and now living in captivity or cultivation. HumanObservation- A report by a known observer that an organism was present at the place and time. MachineObservation- A report by a monitoring device that an organism was present at the place and time. StillImage- An [sic] photograph, drawing, painting. MovingImage- A sequence of still images taken at regular intervals and intended to be played back as a moving image; may include sound. SoundRecording- An audio recording. OtherSpecimen- Any type of specimen not covered by any of the categories above." Careful sleuthing would indicate that someone had already been looking at Dublin Core, though none of the Dublin Core terms had been adopted at that time.

Darwin Core as a standard under TDWG was modeled on Dublin Core and the version of basisOfRecord that came out of the ratification process for the first version had a biodiversity-based type vocabulary and was a subtype of the more generic Dublin Core dcterms:type term. The first-ever official version of basisOfRecord recommended the use of the Darwin Core Type controlled vocabulary, "The specific nature of the data record - a subtype of the dcterms:type. Recommended best practice is to use a controlled vocabulary such as the Darwin Core Type Vocabulary (http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm)." The type vocabulary consisted of terms (classes) in a namespace dwctype:, separate from dwc:, again following the Dublin Core example, with the following labels: FossilSpecimen, HumanObservation, LivingSpecimen, Location, MachineObservation, NomenclaturalChecklist, Occurrence, PreservedSpecimen, and Taxon. Being a subtype of dcterms:type, the Dublin Core Type Vocabulary was also supposed to be valid to use for basisOfRecord.

Later in 2009 there was a decision to rescind the recommendation to populate dcterms:type with Darwin Core terms, "The recommended controlled vocabulary for dcterms:type was changed from a vocabulary of Darwin Core Classes (Occurrence, Taxon, Location, Event) to the DCMI type vocabulary (PhysicalObject, Event, StillImage, MovingImage, Sound, Text, Dataset) to be consistent with the standard use of that term." There was an accompanying decision to take the DCMI terms out of the recommended Type Vocabulary for Darwin Core, "The recommended controlled vocabulary for basisOfRecord remains the Darwin Core Type Vocabulary, but the Dublin Core classes StillImage, MovingImage, and Sound were removed from that list as these are to be used as vocabulary for the dcterms:type term."

In 2011 there was further introspection as we began to think about using Darwin Core with RDF for Linked Open Data. There had been an attempt to add the Dublin Core Type Vocabulary terms StillImage, MovingImage, and Sound back into the recommended list for the Darwin Core Type Vocabulary, but this was rejected with this decision, "The Dublin Core type vocabulary values StillImage, MovingImage, and Sound have not been added to the list of valid values for Darwin Core type vocabulary pending further insights from the RDF Interest Group about how best to manage basisOfRecord when a record can be 'about' more than one subject." At the same time, and for the same reasons, a decision was made to remove the subclass designations from terms in the Darwin Core Type vocabulary, "The subclasses for Darwin Core type vocabulary have been removed. These were seen as too constraining when considering biodiversity information in the context of linked data." This affected PreservedSpecimen, FossilSpecimen, LivingSpecimen, HumanObservation, MachineObservation, and NomenclaturalChecklist, which were all subclasses of dwctype:Occurrence.

In 2013 something curious happened. The MaterialSample term was added to the Darwin Core Type Vocabulary (along with a MaterialSample class in the dwc: namespace). The curious part was that, despite the 2011 decisions to the contrary, dwctype:MaterialSample was defined as a subclassOf OBI:specimen! A month later Joel Sachs questioned this detail in a presentation at TDWG 2013, concluding, "Assertions that tie Core terms to upper ontologies should be asserted in a separate document. That way, those doing integration that depends on OBI axioms can ingest the appropriate descriptions. Those that don’t need the OBI axioms don’t have to worry about incorrect inference."

The Darwin Core Type Vocabulary survived until late 2014, when the utility of having separate (some redundant) terms in a separate namespace in Darwin Core was questioned. Community discussion resulted in the decision to remove the Darwin Core Type Vocabulary in favor of the Darwin Core classes, and to add the classes that were missing in the dwc: namespace (namely PreservedSpecimen, LivingSpecimen, FossilSpecimen, MachineObservation, and HumanObservation). With this change, MaterialSample lost its subclassOf property (though it retains OBI:specimen as a superclass in the Biological Collections Ontology).

Things have been stable (but not without controversy) with respect to terms related to basisOfRecord until July 2021 when MaterialCitation was added as a new class and recommended term in the vocabulary of basisOfRecord.

If you are a glutton for punishment there is a lot of historical discussion going back to 2009 on basisOfRecord that can be followed from this post in the tdwg-content list.

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I decide to use "Object", "Specimen" or "Photo", what happens? Deviation from the Darwin Core class terms will probably mean nobody knows what to do with my records.

Humans will unlikely be able to find your records and machines certainly won't without a lot more help.

I guess I feel that we are essentially saying that basisOfRecord requires a Darwin Core Class and I am not certain that is something we want, or is it?

Controlled vocabularies that are used are more useful for finding things than not having them, or not having them followed. That doesn't seem like the crux of the problem. Some people want more options, and that is an option - just look at MaterialCitation - it was identified as necessary, justified, and put into practice. More of an issue is that in Simple Darwin Core all you get is a row for an Occurrence, an Event, or a Taxon, but that "record" can be "about" lots of things at the same time, and the one I'm interested in publishing might not be the one that interests someone else searching biodiversity databases.

dr-shorthair commented 2 years ago

There is also one term adopted by Darwin Core from Dublin Core, dc:type, which requires adherence to a specific vocabulary, the DCMI Type Vocabulary, among which PhysicalObject, Event, StillImage, MovingImage, Sound, Dataset, Collection, and Text are of interest to us in biodiversity.

Not quite correct. "Recommended practice is to use a controlled vocabulary such as the DCMI Type Vocabulary". You can choose to use a different type vocabulary. In DCAT2 we identified four other general purpose type vocabularies: ISO-19115-1 scope codes; Datacite resource types; PARSE.Insight content-types used by re3data.org; and MARC intellectual resource types. You could also define a domain-specific type vocabulary ... which I think is what you guys have in DWC.

tucotuco commented 2 years ago

I stand corrected. Thanks, Simon.

On Sun, Aug 29, 2021 at 8:37 AM Simon Cox @.***> wrote:

There is also one term adopted by Darwin Core from Dublin Core, dc:type, which requires adherence to a specific vocabulary, the DCMI Type Vocabulary, among which PhysicalObject, Event, StillImage, MovingImage, Sound, Dataset, Collection, and Text are of interest to us in biodiversity.

Not quite correct. "Recommended practice https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/type is to use a controlled vocabulary such as the DCMI Type Vocabulary". You can choose to use a different type vocabulary. In DCAT2 https://www.w3.org/TR/vocab-dcat-2/#Property:resource_type we identified four other general purpose type vocabularies: ISO-19115-1 scope codes; Datacite resource types; PARSE.Insight content-types used by re3data.org; and MARC intellectual resource types. You could also define a domain-specific type vocabulary ... which I think is what you guys have in DWC.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tdwg/material-sample/issues/11#issuecomment-907777008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ72YPSH2GLYGVMP72TM3T7ILWTANCNFSM5CORKNYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

albenson-usgs commented 2 years ago

@Jegelewicz wrote

Maybe this term is so difficult because it is now special and "we" have total control over it. Yet, it is only "recommended", so when I

I would argue this is not true. At least if you are sharing via the IPT then basisOfRecord is definitely a required term and the only one with a required controlled vocabulary.

Jegelewicz commented 2 years ago

image

Is the IPT setting this requirement? Or does "recommended" really mean "thou shalt"?

tucotuco commented 2 years ago

The IPT can be configured to use various Core and Extension definitions. The Cores are for Occurrence, Event, and Taxon. The Event and Taxon Cores do not have a basisOfRecord term in them. The Occurrence Core does have basisOfRecord and it is required to be present in the published dataset. The basisOfRecord term in the Occurrence Core currently has a "thesaurus" element that points to https://github.com/gbif/rs.gbif.org/blob/master/vocabulary/dwc/basis_of_record.xml, which defines the recommended terms and gives translations for their labels in various languages. Because of this thesaurus element, the IPT gives data publishers the option to choose a value from that list in order to map an entire dataset to a constant (one of those values) in lieu of mapping to a field in the dataset. The IPT does not constrain the values in any field mapped to basisOfRecord, which is why there are more that 620 distinct values for that field from the data shared through GBIF. GBIF does their best to interpret those verbatim values into the controlled vocabulary values so that people can still use them effectively for searching in GBIF.

albenson-usgs commented 2 years ago

I wanted to confirm I was correct before responding. You cannot publish a dataset via the IPT if it does not follow the controlled vocabulary and it must follow it explicitly (e.g. "Human Observation" will be rejected). The 620 distinct values in GBIF must be from legacy data or other methods for data sharing that are not the IPT.

CantPublish
tucotuco commented 2 years ago

Good catch, @albenson-usgs .

Jegelewicz commented 2 years ago

Just putting this here because I was thinking about it last night.

It seems to me that "BasisOfRecord" should be more about the evidence at hand and I think there are three types of evidence that we are dealing with.

  1. Primary (or direct) evidence - material that is or was all or part of the object identified as taxon
  2. Secondary evidence - material that indicates the presence of the object identified as taxon (footprint, nest, scat, photograph, sound recording, digital image)
  3. Tertiary evidence - a human records information about an object identified as taxon, but no material support or trace exists (field notes, publications and preparation catalogs when no material is cataloged)

Primary evidence would always be a MaterialSample (which might be fossil, preserved, living or whatever else we end up needing to make it easy for people to put stuff in bins)

Tertiary evidence would always be an Observation (which might be human, but as AI grows, maybe we will have machines reporting what they "saw" without providing evidence of some sort?)

Secondary evidence - this one is the hard grey area. Any MaterialSample that is Secondary evidence would really be a trace of some sort and any Observation would be an image or sound recording (also a trace?).

So maybe instead of the above, we have choices for BasisOfRecord as follows:

MaterialSample - material that is or was all or part of the object identified as taxon (skin, skeleton, tree, seed)

Trace - material that indicates the presence of the object identified as taxon (footprint, nest, scat, photograph, sound recording, digital image)

Observation - recorded information about an object identified as taxon when no MaterialSample or Trace exists (field notes, publications and preparation catalogs)

Probably replace "object" with "organism" for the purposes of TDWG - I am always thinking about cultural collections too.

baskaufs commented 2 years ago

@Jegelewicz I think this is a very useful way to think about categories of evidence.

I think that some of the difficulties that you are encountering at least partly stem from historical precedents on how we've categorized evidence in Darwin Core. At its core, basisOfRecord is a type designation. We have somewhat artificially separated type designations between dwc:basisOfRecord and dcterms:type, I think mostly because there is an expectation that values of dcterms:type will come from the Dublin Core type vocabulary, which doesn't include the specific kinds of types that we care about like preserved specimens.

So what we've ended up with is a somewhat odd situation where certain types of evidence like live organism images taken by people or camera trap photos, which can be typed as dcmitype:StillImage as a value for dcterms:type then have to have an additional designation as dwc:HumanObervation or dwc:MachineObservation as a value for dwc:basisOfRecord. But then for a type of evidence like dwc:PreservedSpecimen, we can use that as a value for dwc:basisOfRecord, but we have no value at all for dcterms:type because we haven't traditionally populated that field for specimens.

To me, it doesn't make sense that we have two terms for type. It would make a lot more sense to me to just have one term for type and populate it with the most specific class that makes sense. If we want to establish a type hierarchy (i.e. ontology with subclassing), then on could infer broader categories from more narrow ones. That is, if we decided that still images are a kind of observation, then state that the evidence is a still image and anyone could reason that it's also an observation. That seems much cleaner than having some kind of artificial distinction where some kind of broad typing happens with basisOfRecord and narrower typing may or may not be done with a separate term. This is more or less what is specified in the DwC RDF guide, where it is suggested that rdf:type be used in preference to dwc:basisOfRecord or dcterms:type.

Unfortunately, basisOfRecord is so deeply embedded in Darwin Core systems (i.e. a required term for everything) that it seems like it would be difficult to "reform" it. But I would just love to ditch basisOfRecord and dcterms:type and just use the single term rdf:type for everything. In that case, the kind of categorization that @Jegelewicz is doing would be the way that we could ontologically relate the various kinds of classes and how they are related to each other.

tucotuco commented 2 years ago

Hopefully all of this is good fodder for a basisOfRecord Task Group, if we don't end up solving it here after all.

Since I am all thumbs I give ten thumbs up for deprecating dc:type and dwc:basisOfRecord in favor of rdf:type.

But that is only part of the problem. The other, important part is that a "record" can be about a lot of things all at once, which we can't solve in Simple Darwin Core without either a term that takes a list (a hack, and in this case an "inconvenience" term) or relegating the term to an extension that can have a one-to-many relationship. ResourceRelationship could accomplish that, but I am not sure of the wisdom of that. I mean, if we are going to do something drastic, why not do something drastic that works better than what we have? This could be by allowing more complex relationships between "tables" of data to be shared in structures rather than the limited star schema we are currently working under. Each of those "tables" could have their own types and we could start to get as specific (or not) as we like with controlled vocabularies for those types.

dr-shorthair commented 2 years ago

I'm a little nervous about rolling all classification into rdf:type. It is a very powerful predicate with some special behaviours and entailments, particularly around sub-classing and participation in global domain/range constraints, which might be reaching beyond the expectations of the people who designed the classification vocabularies. I generally find that rdf:type and dcterms:type can work nicely alongside each other - the former for formal typing, and the latter for more informal classification treated more like annotation.

baskaufs commented 2 years ago

@dr-shorthair I'm unclear about the dangers of using rdf:type here. If you adopt the RDFS view of things, then there are certainly triples that you can assert that create entailments involving rdf:type. But I can't think of any entailments that would be generated directly by making an rdf:type assertion. It's a basic RDF property for asserting the class of thing we are talking about and is nearly universally understood in the RDF/Linked Data world. Isn't that exactly what we are trying to accomplish here (i.e. communicate the type of a resource)? There are no prohibitions against a particular resource having more than one type, so asserting a type doesn't preclude someone else from typing it as something else.

The other piece of this is that TDWG now has a formal policy, expressed in section 4.4.2.2 of the Standards Documentation Specification that we keep entailment-generating statements out of the basic "bag of terms" layer of vocabularies and assert them as separate "layers" on top of the basic layer when their use can be justified. So the burden of determining the implications of entailment-generating statements like range, domain, subclass, etc. falls on the people who want to add those layers on top of the basic layer. That prohibition would not apply to rdf:type declarations, because such declarations do not in themselves result in any entailments that I am aware of.

Jegelewicz commented 2 years ago

Can someone please define RDF? Also, where do I find the mythical "RDF Interest Group"?

baskaufs commented 2 years ago

@Jegelewicz Sorry about that. RDF is Resource Description Framework and it's the "language" that can be used to express metadata in a machine-readable way. It's probably the most common way that linked data is communicated automatically. For a somewhat quirky intro to RDF, you can check out this video from a TDWG meeting years ago: https://www.youtube.com/watch?v=XAGifYBiXMY

As far as I know, there isn't an RDF Interest Group. There once was an RDF task group, which created the Darwin Core RDF Guide, but it hasn't been active for several years, and hopefully has therefore been disbanded since it doesn't have a task any more.

dr-shorthair commented 2 years ago

The special thing about rdf:type is that it entails that a resource is also a member of all the parents of the object class. So my concern is that a value might be assigned to a rdf:type which has not been conceived as a class in a logical subsumption or broader/narrower hierarchy, or is taken from a vocabulary structured inconsistently internally or with respect to another set of classifiers that is already in use.

My experience is that dcterms:type provides a safe-haven for less formal classifiers, which makes them more like annotations (there are no formal entailments associated with DCMI), but where the predicate is still taken from a 'standard' RDF vocabulary.

baskaufs commented 2 years ago

@dr-shorthair I see your point. However, as I understand it, the task of this group is essentially to lay out a class hierarchy for Darwin Core classes (perhaps also including the Dublin Core classes) and to assert those classes as the recommended values for whatever term we end up using for typing (currently dwc:basisOfRecord, potentially rdf:type). So presumably if we do our job well, the situation you are worried about shouldn't usually happen for Darwin Core records.

I find it instructive to think about what has happened in Wikidata. Although the Wikibase model has an RDF serialization, it very intentionally avoids going into the RDFS realm. rdf:type doesn't really exist there as a property for describing kinds of items. But almost immediately the property P31 ("instance of") was created and it's now recommended to be used as a property of all items. And now people are building class hierarchies using P279 ("subclass of") to lay out the relationships between the informal class items that have been created. So despite not using the RDFS model directly, people have essentially re-created an RDFS analog because people have a basic need to understand what something is and how it's type is related to other classes. The key difference is that the Wikidata system of typing and subclassing does not support automatically computing entailed triples. Rather, people discover the hierarchical relationships using SPARQL queries and "police" the correct use of terms using ShEx rather than trying to establish ranges and domains for properties.

My point here is that we are dealing with a special case of controlled vocabulary. We are not just trying to lead people to choose the right term to categorize the subject resource (as one might do with a SKOS-based controlled vocabulary), but we are also trying to meet this fundamental need that people have to know what kind of thing we are describing. The thing about using rdf:type is that it is the most straightforward and widely recognized linked data term for indicating what kind of thing a resource is. What you say is true about it can be part of generating entailments when combined with subclass, range, or domain declarations. But as far as I can tell, carrying out that kind of reasoning is extremely rare in our community, so it seems to me that the risk of the problems you raise is low. In contrast, the value of using an extremely well-known property is high -- it's so well known that it's the only property that has its own special abbreviation in RDF Turtle ("a"). DCMI itself recognized this fact when it recommended

It is recommended that RDF applications implementing this specification primarily use and understand rdf:type in place of dcterms:type when expressing Dublin Core™ metadata in RDF, as most RDF processors come with built-in knowledge of rdf:type.

That recommendation was carried through in Section 2.3.1.4 of the Darwin Core RDF Guide, which specifies that rdf:type should be used for IRI-valued typing, and explains why there isn't any dwciri: analog of dwc:basisOfRecord.

One problem with my suggestion of replacing dwc:basisOfRecord with rdf:type is that the value of rdf:type should be an IRI and not a string. That would differ from the current practice of using a controlled value string for basisOfRecord. However, I don't think it would necessarily be a problem to require an IRI as long as we tell people what it should be. That's a social problem, not a technological one and I've taken the position that there shouldn't be any problem with people using properties that require an IRI value in spreadsheets.

Another advantage of going with rdf:type over dwc:basisOfRecord is that it would make it easier to formally define a subclass hierarchy using rdfs:subClassOf. That can't readily be done using values of dwc:basisOfRecord, given that they are literals and therefore can't be used as subjects in triples. That is, if this group is successful at more clearly defining the classes to be used for typing and their hierarchical relationships, it would be relatively straightforward to express that as a machine-readable ontology that could be informally or formally laid on top of the "bag of terms" layer of DwC.

Jegelewicz commented 2 years ago

I am not 100% sure I get all of this, but there is one thing that appears useful to me and that is simplification. If we can rid ourselves of BasisOfRecord in favor of something that is more globally recognized and useful, then I think it makes a lot of sense. We don't need to invent a wheel that is special for biological collections and their data. If we can use the wheel that is already out there, then we can add special hubcaps if we really need them - right?

Thanks for the video @baskaufs but also, this is a good place to learn about RDF - https://en.wikipedia.org/wiki/RDF_Schema

Could someone give me a practical example of how this would work as opposed to what is happening now?

dr-shorthair commented 2 years ago

it's so well known that it's the only property that has its own special abbreviation in RDF Turtle ("a"). DCMI itself recognized this fact when it recommended

Yes, indeed. It is so well known also because it has the strongest effects. Your story about Wikidata is illuminating - I am frustrated with WD (and OBO) because of the ways that they invented new predicates for things that already exist in RDFS, etc.

I hadn't seen that note from DCMI. It certainly makes sense for 'strong' uses of typing. But I don't think it invalidates my proposed use of dcterms:type, precisely because there are no entailments associated with DC properties.

However, if you think that DWC classes will be well-enough controlled to be able to move all usage to rdf:type then certainly that makes sense in that context.

baskaufs commented 2 years ago

@Jegelewicz How about this:

image

It's an example spreadsheet from the IPT User's Manual. The top part is the existing example, the bottom part is how I would change it. Some important points:

  1. When the class IRI is put in the table, the full IRI must be used, vs. abbreviating it like dwc:PreservedSpecimen. There are a couple reason for that. First of all, people don't generally understand what dwc: means (i.e. that it's an abbreviation for http://rs.tdwg.org/dwc/terms/). Second, the namespace abbreviations are not always consistent, particularly in Dublin Core where you see dc:, dcterms:, and even dcterm: used inconsistently for the same namespace. The benefit of not having to disambiguate values would be lost if the IRIs were abbreviated.
  2. In the example, the headers only have the local names and don't give any indication of the namespace. This basically assumes that all of the column header properties are in the dwc: namespace, which is not a safe assumption if people want to use dwciri: terms. I think @tucotuco could speak to this problem. I'm not sure it's a problem for a Darwin Core Archive if it has a meta.xml file that maps the columns to the actual full property IRIs, but it may be a problem for simple spreadsheets. In this case, it would be better to have a header like rdf_type than just type given that rdf:type, dc:type, and dcterms:type all have the same local name.
  3. This does not deal with the problem that @tucotuco brought up about a single row being "about" more than one type of thing. So if you have a specimen and an image of the specimen being described in the same row, you'd have two types to specify. But that isn't the problem we are trying to fix here.
Jegelewicz commented 2 years ago

@baskaufs thanks! That makes sense and I think it should be easy to get people on board with that as long as we provide clear instructions for use.

ghwhitbread commented 2 years ago

Basis of record: An alternative point of view ?

Darwin Core (and the aggregators that need it) exists because there are information systems out here that are built for the purposes of managing natural history collections (specimens, images, observations, samples) and able to provide meta-data extracts that look like occurrence records. In making these data generically available, Darwin Cores has played a critical role in the rise of biodiversity informatics and also for the status of Collections.

For the most part, these Darwin Core records are answers to a question. They are not persistent objects. The persistent objects (things like taxon_concept, name, agent, locality, event, observation, annotation (determination), treatment, work, collection item (specimen, sample, image), site, enclosure) in our information systems, and their relationships and behaviours, have been carefully modelled to meet the business requirements intrinsic to the management of Collections. They are already implemented and evolving, come with their own entailments, and critically, do not exist for the primary purpose of delivering Darwin Core.

In normative Darwin Core the collection objects that are the subject of this discussion - the evidence for occurrence - provide the real world relationship that makes a Darwin Core record useful at all. It is not the role of Darwin Core to model them. These Collection objects live in the realm of collection management systems, and for an occurrence meta-data interchange standard we simply need a way to convey the nature of the underlying objects - the basis of each record - and a means for attribution to the sources.

To expand on John’s concise term history. The “basis of record group” was previously used in HISPID (1989 - TDWG standard (1996) and in ABCD as “record source”, taken from ABIS (1976)*, as a simple, and unambiguous, statement of origin as an aid to determine the falsifiability of reified assertions and assumptions derived from meta-data extracts.

I would advise a simple, though extensible, vocabulary. And as @afuchs1 suggests - a basis of record category for bundling object identifiers. If modelling collection objects is your thing, please keep them separate from Darwin Core.

Very early in this discussion, @tucotuco stated that Darwin Core does not have sub-classes. I believe that is partly because there are no real classes either. Darwin Core “classes” are in fact categories that can be used to group (rearrange) terms into records, without entailment, for the purpose of meta-data transfer, most often from Collections to Aggregators, beyond the constraints of types at the records origin.

This is why Darwin Core is successful.

*ABIS: Australian Biotaxonomic/Biogeographic Information System (Australian Biological Resources Study - ABRS)

Jegelewicz commented 2 years ago

a basis of record category for bundling object identifiers. If modelling collection objects is your thing, please keep them separate from Darwin Core.

I am not sure that I understand this statement. I feel that "modelling collection objects" and Darwin Core are inextricably connected, but maybe I am misunderstanding something?

Darwin Core “classes” are in fact categories that can be used to group (rearrange) terms into records, without entailment, for the purpose of meta-data transfer, most often from Collections to Aggregators, beyond the constraints of types at the records origin.

I think that it is very difficult to make anyone believe that Darwin Core is "classless" when the "classes" are identified with the definitions. I say this because I have to explain this to others and I am not 100% sure that I can confidently do so. But, this statement also makes me think that a lot of collections are doing Darwin Core wrong AND that the Darwin Core archive facilitates this? If ALL the aggregators are interested in are occurrences, maybe we shouldn't include anything about the collection object in the Darwin Core archive sent to them other than the fact that BasisOfRecord = material evidence? Is that what this statement and the one above are getting at?

a simple, and unambiguous, statement of origin as an aid to determine the falsifiability of reiified assertions and assumptions derived from meta-data extracts.

To me, there are a lot of jargon terms in this statement. If we are going to make this useful to everyone, we need to explain in the most plain terms possible what we are trying to get at. We are asking collections staff to be experts in biology, collections management and now computing language - and for some it will be OK, but for a lot, it will be one more thing they don't have time to figure out and that will lead to poor data quality.

baskaufs commented 2 years ago

A response to @ghwhitbread's comment:

It may be true that Darwin Core originated to serve information systems for managing natural history collections. But at this point, use of Darwin Core goes way beyond that. It's used broadly with remote sensing data, camera trap data, and data from citizen science projects like iNaturalist and eBird, and these kinds of projects may have no connection to natural history collections (unless Greg intends these kind of "collections" to also be considered "natural history collections"). Assuming that there is always going to be a "basis of a record" that is a physical object sitting in a museum simply does not hold for many of these new kinds of records described by Darwin Core. And those records may have evidence whose type doesn't fall into a current Darwin Core class that's historically been used as a value for dwc:basisOfRecord. More often in the examples I've given, it will be a media item like a still image or sound recording.

I also disagree with the statement that Darwin Core has no real classes. There are a number of classes formally defined as such among the Darwin Core terms: dwc:Event, dwc:MaterialSample, dwc:Occurrence, dwc:PreservedSpecimen, etc., all of which formally have the type rdfs:Class. What Darwin Core does NOT have are formal domain declarations associating properties with those classes. So in that sense, the classes are used as Greg said: to organize the terms. But that does not prevent them from being used as we might use other terms that are classes, i.e. as values for rdf:type or to construct formal class hierarchies with subclass declarations. Those hierarchies may never formally become part of Darwin Core, but that doesn't mean they wouldn't be useful for clarifying relationships or even machine reasoning.

albenson-usgs commented 2 years ago

+1 to Steve and just want to add that there are data where there is no evidence in the way it's being discussed here such as coral reef monitoring using the point line intercept method where the only evidence is that someone wrote it down on a piece of paper- not even a still image or a sound recording - same for small mammal trapping, vegetation surveys, fish trawls,...

Jegelewicz commented 2 years ago

only evidence is that someone wrote it down on a piece of paper

See https://github.com/tdwg/material-sample/issues/11#issuecomment-912758014

dr-shorthair commented 2 years ago

@baskaufs I checked with @tombaker about this from the 2008 DC-in-RDF guidance:

It is recommended that RDF applications implementing this specification primarily use and understand rdf:type in place of dcterms:type when expressing Dublin Core™ metadata in RDF

His response:

Gosh, 2008 was a pretty long time ago...

I think this was well-intentioned, to promote the use of rdf:type when one needs to assert that something is an instance of a class. However - with wisdom of hindsight - we went a step too far in the [DCMI] Usage Board by giving dcterms:type a range of rdfs:Class. "Too far", because there are perfectly legitimate assertions of typeness that one might want to express in RDF, but not with an RDF or OWL class. The best example, from Antoine [Isaac] 's experience in Europeana, is the use of dcterms:type with a SKOS concept.

In the latest published version of DCMI Metadata Terms we corrected this overly tight definition by dropping the range of dcterms:type (this was also discussed in DXWG, see https://github.com/w3c/dxwg/issues/1362).

I added

[the DWC folk] are reading this as essentially deprecating or retiring dcterms:type in an RDF context. Is that your understanding?

To which his response was

No - on the contrary, I think it is good that there is a type predicate that is not limited to a range of OWL or RDF class. For expressing an OWL or RDF class, use rdf:type, but dcterms:type can be useful for any other sort of type assertion.

... which is all pretty much what I'd been arguing.

(Tom is probably the longest continuous member of the DCMI leadership.) (Tom gave me permission to quote from our email exchange.)

baskaufs commented 2 years ago

Note: ALL CAPS used here as in RFC 2119 based on normative sections of the DwC RDF Guide.

@dr-shorthair I would certainly trust what @tombaker says about DCMI. Thanks for sharing the interesting W3C discussion, which I had not seen.

I think what this boils down to is some kind of decision about a tradeoff between the advantage of using the most basic and widely use property for type (rdf:type) and the potential pitfalls of that creating some kind of potential problem involving machine reasoning that would argue for some less constrained term like dcterms:type.

To some extent, this is a moot point. The current requirement for Darwin Core is that in RDF rdf:type SHOULD be used to indicate the type of a resource when an IRI value is provided. Rightly or wrongly, since 2015 this has been enshrined as normative content in a prescriptive document that's an official part of Darwin Core and would require going through an official change process to make it be otherwise. See Table 3.1 as well as Section 2.3.1 and its normative subsections. So using something other than rdf:type for designating type isn't really under consideration here and would have to be raised as a term change request.

What we are actually discussing is whether to essentially get rid of dwc:basisOfRecord as superfluous. If rdf:type remains as the RECOMMENDED type term as specified in the RDF guide, it does not make sense to me that we cause disruptions to existing implementations by substitution one OPTIONAL type term for another (See Section 2.3.1.4 of the RDF Guide).

Did we get this right in the RDF Guide? I don't know -- we made the best decisions we could based on how we understood the landscape in 2015 and on the expertise we had to draw from in the Task Group and though the expert review. Certainly that landscape has changed in the last 6 years. It seems to me that there is certainly much more interest in Linked Data (linking stuff in a simple way) and less interest in Semantic Web (machine-reasoned entailments) and it's been suggested that the RDF Guide be revised in light of current art. But a key aspect of changing TDWG standards is the "stability" requirement (VMS section 3.1). That is, the desire for a better outcome ("efficacy") has to be balanced against the possibility of breaking stuff ("stability). I think that is the key question that has to be weighed here with respect to whether we should just get rid of dwc:basisOfRecord in favor of rdf:type.

deepreef commented 2 years ago

I'm WAY behind on many things, this thread included. I know I'm late to the party, but a few comments:

But I would just love to ditch basisOfRecord and dcterms:type and just use the single term rdf:type for everything.

I'll raise @tucotuco's ten thumbs up to twenty! I'll also throw in a few heart emojis as well.

But that praise aside, this is the comment (from @tucotuco) that made my heart go pitter-patter:

But that is only part of the problem. The other, important part is that a "record" can be about a lot of things all at once, which we can't solve in Simple Darwin Core without either a term that takes a list (a hack, and in this case an "inconvenience" term) or relegating the term to an extension that can have a one-to-many relationship. ResourceRelationship could accomplish that, but I am not sure of the wisdom of that. I mean, if we are going to do something drastic, why not do something drastic that works better than what we have? This could be by allowing more complex relationships between "tables" of data to be shared in structures rather than the limited star schema we are currently working under. Each of those "tables" could have their own types and we could start to get as specific (or not) as we like with controlled vocabularies for those types.

Yes! Yes! Yes! But... I wouldn't write off the role of ResourceRelationship just yet. I see it less of an issue within DwC itself, and more as an opportunity to replace the star schema approach in IPT with a system that puts ResourceRelationship at the center, and bundles instances of Class-specific term values for the relevant metadata. In other words, ResourceRelationship becomes the only "core", and all the other DwC classes and the data associated with terms organized within them become the "extensions" around that core. If we want to get really bold, we can deprecate the recursive "foreign key" terms (parentEventID, [xxx]NameUsageID, etc.) -- basically anything with the 'ID' suffix that points to another instance in the same class -- and represent these associations within instances of the core ResourceRelationship. When DwC establishes classes for things like "Agent" and "Reference", then other xxxID terms (recordedByID, IdentifiedById, nameAccordingToID, namePublishedInID, etc.) could likewise be represented as instances of ResourceRelationship. This approach would not so much be an effort to "onologize" DwC or turn it into a bunch of "tables"; but rather, it would be in how the DwC terms are structured in an exchange mechanism (like IPT) -- essentially replacing the star schema approach (which served us well in its time, but we seem to be on the verge of outgrowing it as a community).

The way to visualize what I'm trying to suggest here is not so much as a star schema, but as a hub and spokes. The hub is a set of ResourceRelationship instances. Each spoke connects instances in the hub to instances in the various DwC "extension" classes, each with a set of instances linked to either resourceID or relatedResourceID in the hub. (i.e., each row in the hub would connect to two rows in one or two of the extensions, one via resourceID and one via relatedResourceID). This would essentially eliminate the existing problem of 'a single row being "about" more than one type of thing'. Each row (instance) would be bundled within its relevant class, and therefore would be self-evidently typed (not sure exactly how this translates in the RDF context, but I would assume each spoke would be represented by a rdf:type value). If you want to think about it in terms of relational databases, the spokes (classes) are the tables, and ResourceRelationship is the universal "join" table, accommodating many-to-many, one-to-many, and even one-to-one relationships within and between records in the various tables.

A radical departure of this sort couldn't happen overnight. It would need to live in parallel with the existing flattened record star schema approach to allow content providers to transition over, and folks like GBIF/iDigBio would need to "flatten" the new system to the old method to aggregate content. But eventually (as measured by usage statistics), the star schema approach would attenuate.

RogerBurkhalter commented 2 years ago

I could get on board with this, and yes, it would best transition as a gradual process. That would also give time for aggregators to transition interfaces to reflect changes. Current GBIF search interfaces prominently feature BasisOfRecord as a sorting field (although not indexed), this would need modifying. I am intrigued by the "only evidence is that someone wrote it down on a piece of paper" comment earlier in the thread. Tens to hundreds of thousands of published observation records exist in paleontologic and geologic measured section records naming observations at specific locations, many not supported by collections. These observations range from general "bivalves and corals" to named taxa occurrences based on the expertise of the observer. Being published, they are certainly "written on paper".

deepreef commented 2 years ago

I am intrigued by the "only evidence is that someone wrote it down on a piece of paper" comment earlier in the thread. Tens to hundreds of thousands of published observation records exist in paleontologic and geologic measured section records naming observations at specific locations, many not supported by collections. These observations range from general "bivalves and corals" to named taxa occurrences based on the expertise of the observer. Being published, they are certainly "written on paper".

We deal with these all the time. Typical scenarios:

There are other cases as well. Basically, we treat HumanObservation very broadly, allowing for Events with broad spatial and temporal scopes, but anchored to at least some source of the information, even if the information is very non-specific.

baskaufs commented 2 years ago

@RogerBurkhalter @albenson-usgs With respect to observations "written on paper" that were published, that seems fairly similar to the intention of the class http://rs.tdwg.org/dwc/terms/MaterialCitation. The normative definition specifically mentions "specimens", but the non-normative Examples includes "An occurrence mentioned in a field note book."

This seems like the best type value option of an existing DwC class for the type of an observation that is published but not documented by a media item or physical specimen. Going outside of DwC, dcmitype:Text would be "well known" and not limited to published material. foaf:Document is also very widely used, but is interpreted so broadly that it includes other types of resources like images that we have narrower options for.

albenson-usgs commented 2 years ago

@baskaufs I did watch that discussion of MaterialCitation and I personally feel some qualms about using it for the observations we're talking about. I feel there is actually a distinction to be made between records submitted by Plazi and the ones I'm talking about and that grouping them together in this way may cause confusion. Similarly applying dcmitype:Text doesn't seem quite right to me either as most data sheets are not retained. They are entered into a database or spreadsheet and usually not kept. Seems to me that dcmitype:Dataset would actually be more appropriate but I'm not sure what users are going to do with that information or how it's useful.

I would be curious to know how type is applied to climate data observations. For instance the BATS observations of temperature, salinity, CO2, etc- what type would be applied to these? As far as I know there is no image, specimen, sound recording, etc. It's recorded by a CTD and loaded directly to a computer.

Jegelewicz commented 2 years ago

@albenson-usgs could you elaborate on your qualms? Just looking at the BATS stuff makes me think that these are automatically recorded by the instruments and as such are "published" at that time. I think we need to think about "digital" writing in the same way we think about pen and paper and the HOW it was recorded (written) would be an attribute of the observation.

baskaufs commented 2 years ago

@albenson-usgs @Jegelewicz I think what you are talking about is extremely important and I've struggled in my mind to imagine how this kind of documentation should be modeled.

I think the key thing here is to keep in mind the use cases for why we want to know dwc:basisOfRecord/rdf:type. In my mind there are two big ones:

In the first use case, users might only be interested in occurrences based on (for example) specimens and want to filter on that. In the second use case, users might want to understand what it would take to be able to examine the underlying evidence (must I go to a museum to look at it? can I download an image? is there a scan or pdf of the notes?). (This line of thinking is similar, but not exactly the same as the breakdown from @Jegelewicz here.)

It seems like the latter use case is what could help drive the decision making about how to organize the classes used for typing. If the evidence were some kind of physical thing, then we'd need to acquire or visit it to assess its validity as evidence, and it might suggest that we should check if it's been digitized in some way. If the evidence were some kind of electronic surrogate (sound recording, video, still image), then we'd just download it if it were available. If there were no known evidence, we'd like to know that so that we don't waste time looking for something we can't get.

So where do the problematic kinds of items brought up by @albenson-usgs fall within these three categories? Data directly recorded from an instrument would be in the second category if it could be downloaded in some raw original file format. If data sheets weren't retained, then I suppose the third category (we have the data but no "evidence") to support it. Something acquired from a publication would probably be in the second category if it had been digitized -- there's the possibility that it might only be a physical book or paper item, but if that were the case, there probably wouldn't be an electronic record of its data anyway.

Whether or not a dcmitype:Datasetwould be a useful value depends I guess on whether you are considering it to be something different that the metadata about the occurrence itself. If no data sheets or original spreasheets were retained, it doesn't seem good to assign a dcmitype:Dataset to describe the evidence. I'd rather flag it as lacking evidence. If there were some spreadsheet that were preserved for download, I suppose dcmitype:Dataset might make sense, but some class for "spreadsheet" would probably be more sensible.

I'm suspicious that @deepreef is going to say something about documenting the state of what's in somebody's head as evidence or something like that. I think something like that came up in a previous round of discussion on this topic. But I'm advocating the use-case driven approach, so even if the evidence was what was in somebody's brain, it's not visitable (category 1) or downloadable (category 2), so it goes into category 3 (no obtainable evidence).

If a class hierarchy were developed, I'd have these three categories at the top. The narrower categories could be asserted directly as values for rdf:type/basisOfRecord for searching/sorting/filtering but one could traverse upwards in the hierarchy to answer the ultimate question of whether it's something I can download, or need to visit borrow, or just forget about.

Not sure what would make sense as a class value if there weren't evidence. rdf:nil? I'm sure there would probably be something wrong with that. Intuitively it might make sense to say that the basis of a record was nothing, but not to say that rdf:type is "nothing". If something didn't have a type, you'd omit the property, but that is ambiguous as to whether the type is unknown or perhaps that there isn't any defined type that fits. This is making my brain hurt...

albenson-usgs commented 2 years ago

I did attempt to provide some of them in my comment above but I'll try to expand. To me to seems strange to put these observations in the same bucket as these ones. They seem quite different to me. If you read the justification for MaterialCitation here specifically "material citations are extracted from publications and submitted as part of data sets to GBIF and reused in studies" that is not where the observations I'm talking about are coming from and to me it seems useful to be able to distinguish between them. I don't think dcmitype:Text is appropriate for the data I'm talking about but I do think it's appropriate for the Plazi records that gave rise to the MaterialCitation proposal. I think we need to be clear about what we mean by "published". If a CTD cast is logged on a specific computer and not shared, is it really published? I don't think I agree that when observations are automatically recorded by instruments that we can consider them published. They still need to be shared before that's the case. I bring the BATS example up because I think we should be looking at how other communities are handling these types of observations and make sure we're in line with what they are doing or else have a logical reason for why we are not (I doubt that the BATS observations use dcmitype:Text for instance and I don't know what they are using, they might not be using rdf:type at all). Finally I'm concerned that these types of observations might be considered as kind of an afterthought to how the other observations are being talked about and thought about in this thread. I want them to be on a similar footing to all the other observations.

Steve's message came in as I was typing this up.

One use case for basisOfRecord that I do know of is that GBIF is using it to determine whether or not institutionID, institutionCode, collectionCode, and collectionID should be matched to GRSciColl. If the data have basisOfRecord = HumanObservation or MachineObservation then they are not flagged for not finding a match in GRSciColl.

Another use case for basisOfRecord (rightly or perhaps wrongly) the biologging task team has recommended using basisOfRecord = HumanObservation for when the animal is in hand and the device is being attached to the animal and MachineObservation for when the device is recording the location of the animal.

Based on those two I'm concerned that placing too much emphasis on the second bullet of Steve's (to understand the kind of evidence that is available to document an occurrence or determination) means we will lose the ability for the first bullet (as a categorization mechanism for searching/sorting/filtering).

All of my interest in this is based on the fact that currently basisOfRecord is a required term. If it is not required then I think my thoughts about all of this may have different results. But I'd like to understand why it was made required in the IPT with a controlled vocabulary and if that provided the kinds of results people were expecting when that decision was made.

deepreef commented 2 years ago

I also have a few qualms about how MaterialCitaion is intended & defined. As I alluded to above, these qualms related to the meaning of "Material" and "Citation". @baskaufs already noted the inconsistency in the definition and the non-normative documents, and the two words that are problematic in the normative definition are "specimen" (~'Material') and "publication" (~'Citation`). Both are contradicted by the non--normative example "An occurrence mentioned in a field note book."

There was some discussion of this here and here.

Specimen: In our other discussions on MaterialSample, we have excluded non-vouchered observations from scope (which I think is good). But I'm not so sure this exclusion is anchored to the "Material" part, or the "Sample" part. I'm inclined to go with the latter, which suggest that "Material" in MaterialCitation doesn't necessarily imply "physical matter extracted from nature"; but rather simply "physical matter". Observed organisms are certainly physical matter, and therefore "Material", and so I think it's fair to say that documenting unvouchered observed organisms in nature could fall within scope.

Publication: How is this defined (i.e., what is in scope to be "cited")? A field notebook is a single-copy document, and probably would fit only the very broadest definition of that term. @myrmoteras indicated he preferred a more restricted scope, limiting MaterialCitation instances to those documented in traditional/peer-review-type published works. I disagree, in part because that's an entirely arbitrary boundary for what is "published" and not (I see endless debates about many edge cases), but mostly because in general for DwC I think it's best to extend the allowable scope very broadly to accommodate a wide array of use-cases. As biologists we care about information included in unpublished single-copy documents (like field notebooks); so how else would we class this information for exchange?

Obviously, the actual term (MaterialCitation) doesn't dictate its definition or scope; but it's a good place to start when discussing what the definition and scope should be.

deepreef commented 2 years ago

Woah -- more posts on this coming in faster than I can chime in....

I feel there is actually a distinction to be made between records submitted by Plazi and the ones I'm talking about and that grouping them together in this way may cause confusion.

I agree there is a distinction, but I think that distinction is best captured through the value of some property term that would be organized in the MaterialCitation class (see here for additional thoughts on this). I don't think that should be the basis for forcing this sort of information into a different DwC class. Certainly Plazi will be generating gazillions of instances of MaterialCitation from publications, and that's great. But it shouldn't stop us from using the same concept to accommodate near-identical properties from unpublished sources, even if that wasn't included in the original justification. If I recall correctly the original justification for MaterialSample was to be able to deal with soil samples and jars of water and such, but has taken on a much broader role to encompass specimens, etc. (which I see as a GOOD thing).

As noted earlier in this discussion, what we seem to be talking about is what @baskaufs and I and others have been making noise about for a few years, which is the notion of "Evidence". Part of the problem we're trying to untangle here is the conflation for an Occurrence (presence of an Organism at an Event) and Evidence to support the truth of that Occurrence (specimens, images, videos, sound recordings, human/machine observations, documented assertions, etc.) We have mechanisms in DwC to accommodate specimens (MaterialSample), multimedia (AudubonCore), and human/machine observations (sort of); but the one subclass of "Evidence" that had been missing is "documented assertions". One approach is to treat these as Observations (as I have been doing; see above). But it seems to me that MaterialCitation provides the perfect class to round-out the various forms of Evidence that support the truth of Occurrence instances (and also support the truth of Identification instances; but that's another topic of discussion).

I'm suspicious that @deepreef is going to say something about documenting the state of what's in somebody's head as evidence or something like that.

Nope -- wasn't going there (not entirely sure where "there" is, actually). I prefer a term along the lines of "documented assertions". Sure Observations are in someone's head, but absent some fancy CT scan of the brain while the observation was taking place, I would say that Observations don't exist until they are documented in some way. It's just that I'm liberal about the scope of what form that documentation might take (without having to parse "published" from "unpublished" sources).

dr-shorthair commented 2 years ago

Not sure what would make sense as a class value if there weren't evidence. rdf:nil?

RDF Open World Assumption does not require a value. OTOH data validation using shapes might do. An explicit 'nil' is a good thing, if that is what you need to say.

cboelling commented 2 years ago

An explicit 'nil' is a good thing, if that is what you need to say.

This is a good thing only if it is meant to express that the type of evidence is unknown (which is probably what the quoted statement was meant to convey?). Asserting that there is no evidence for a record makes no sense in a scientific context - unless this is meant to convey that the occurrence record is invalid because there is no evidence for it.

baskaufs commented 2 years ago

@cboelling The quoted sentence makes better sense in the context of my earlier comment. What I was talking about was "category 3" where the evidence is not available because it wasn't preserved or isn't available (e.g. on somebody's random hard drive). So what would need to be conveyed is that evidence is not available for examination, either by viewing a physical object or acquiring a digital one. I was not trying to imply that there never was any evidence, just that it's not available.

@dr-shorthair noted the Open World assumption of RDF. That means that if we don't supply a value, one cannot assume that there isn't one, or to put it in this context, one cannot assume that the evidence is not available. My concern is was whether use of rdf:nil to provide a value would be an abuse of the intention of rdf:nil. It's in the non-normative section of the RDFS specification about collections. So that implies to me that it's definition is a bit loose. The definition given there is:

The resource rdf:nil is an instance of rdf:List that can be used to represent an empty list or other list-like structure.

That it is an instance of rdf:List is what gives me pause. However, as I recall, I've seen it used somewhere else to indicate that there wasn't a value in a non-List situation. What exactly would it "mean" if I made the statement:

ex:my_occurrence rdf:type rdf:nil.

That's basically what would happen if rdf:nil were used as a value. It would also be problematic because rdf:type would inply that it's a class, but the spec says it's an instance, which would push it into OWL Full as both a class and an instance.

Basically, I think there probably is a better value or method to use to indicate that evidence is unavailable. I guess we could just mint a IRI. But this is why my head hurts when I try to think this out.

dr-shorthair commented 2 years ago

Right. rdf:nil is the stopping point for lists (which are built as a series of links in RDF). I'm not sure it would be wise to use it for much else.

Like I said an explicit 'nil' might be useful. rdf:nil might not be the best choice though.

dr-shorthair commented 2 years ago

FWIW - OGC defined a suite of different nil values or nil-reasons here: http://www.opengis.net/def/nil/OGC/0/ They were originally conceived for GML (i.e. in the XML era), and have been partially ported to a linked data environment. The above detection range ones were added to support instrumental observations and measurements.

I'm by no means suggesting that the OGC values are fit for purpose here, but just holding them up as an example of how this issue has been dealt with elsewhere.

albenson-usgs commented 2 years ago

Based on the call this morning (anyone feel free to correct me) it seems to me that we need to determine:

  1. If basisOfRecord is only supposed to be an indication of the evidence that is available to document an occurrence or determination (and NOT as a categorization mechanism for searching, sorting as it currently is for e.g. PreservedSpecimen and FossilSpecimen)?
  2. If yes to the above, should it be replaced with rdf:type?
  3. If yes, does the dcmitype vocabulary work for our needs to document the evidence for an occurrence or determination?
  4. Separately (but just as important to me) should basisOfRecord / rdf:type be a required term in the IPT with a controlled vocabulary?

Apologies if this is off base. Just trying to determine where the decision points are to resolve this.

tucotuco commented 2 years ago

This might be a good thing to discuss in the second meeting today if possible.

On Wed, Sep 15, 2021 at 5:44 PM Abby Benson @.***> wrote:

Based on the call this morning (anyone feel free to correct me) it seems to me that we need to determine:

  1. If basisOfRecord is only supposed to be an indication of the evidence that is available to document an occurrence or determination (and NOT as a categorization mechanism for searching, sorting as it currently is for e.g. PreservedSpecimen and FossilSpecimen)?
  2. If yes to the above, should it be replaced with rdf:type?
  3. If yes, does the dcmitype vocabulary https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/dcmitype/Collection work for our needs to document the evidence for an occurrence or determination?
  4. Separately (but just as important to me) should basisOfRecord / rdf:type be a required term in the IPT with a controlled vocabulary?

Apologies if this is off base. Just trying to determine where the decision points are to resolve this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tdwg/material-sample/issues/11#issuecomment-920362964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADQ723ANCM2CJFNIQQF54DUCEAT7ANCNFSM5CORKNYQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Jegelewicz commented 2 years ago

@tucotuco "BasisOfRecord - best evidence for an occurrence"

Jegelewicz commented 2 years ago

@m-hope "the pathway from organism at event to taxon at location in time" how the organism got identified as taxon.

Jegelewicz commented 2 years ago

Me - "What is this record about"

Jegelewicz commented 2 years ago

I believe we have determined that BasisOfRecord is an overloaded term that we are expecting to convey information about types of evidence, identification practices, suitability for use, and possibly more. Feel free to add to the list here or with use cases in the Wiki.