iSamples review - Githubissues

Jegelewicz commented 2 years ago

Please post questions or discussion about the iSamples terms in this issue.

Jegelewicz commented 2 years ago

TDWG MS Task Group member,

If you missed our meetings on May 18, we had very productive discussions about materialSampleType. In brief, we determined that this single term is overloaded. This came about as a result of an explanation of the three terms that the iSamples group has worked out for describing the "type" of sample. These terms are as follows:

specimenType materialType sampledFeatureType

Each of these terms comes with a controlled vocabulary which can be reviewed in the iSamplesVocabularies tab of our brainstorming document.

Due to the Digital Data Conference this week and SPNHC June 5-10, I am not able to host any working hours this month, but I challenge each of you to come up with any material sample that cannot be appropriately described using the three iSamples terms and their controlled vocabularies. If you do find something that cannot be defined using this scheme, please offer additional controlled vocabulary for the term in question OR define a new term if one is needed along with any controlled vocabulary the new term might use. Add your suggestions to the iSamples Issues tab of the brainstorming document.

Our hope is that we may incorporate the iSamples terms and vocabularies into Darwin Core as they are, but if modifications are needed we should suggest those as well. I look forward to our discussion of potential changes to the iSamples scheme and hope that the team who worked on this project will be amenable to making their work part of Darwin Core. If you have any questions about this - please post them in this Github issue.

TDWG 2022 abstracts are due July 1 and we have been asked to submit a talk for the Information session about late-stage Task Group submissions of standards additions symposium. I have a document started here and plan to discuss the abstract at our meeting on June 15. (Note - I haven't started the abstract yet - all that is in the document is information about the call for abstracts and the symposium).

Thank you to everyone for showing up and keeping this task force moving forward!

Adios,

Teresa J. Mayfield-Meyer

dr-shorthair commented 2 years ago

Nice work @Jegelewicz

Jegelewicz commented 2 years ago

Google doc convo between Jutta and myself

(Jutta) Is it really critical that terms are used exactly the same across fields and communities? Eg. just had the issue with “thallus” in AC: actually a very different, non-homologue structure in algae, liverworts, fungi and lichens. However, within a given search/work context the term and its use then are very specific.

[Teresa] - We handle this in Arctos with a comprehensive definition. “For X, this means Y and for Z this means a”. It also serves as a kind of warning - don’t just pick all the “thallus” or you will get things you don’t want. To get what you want, you will need to search “thallus” plus something else (i.e. taxa) as well.

(Jutta) If you have an iterative process, with somebody pointing out that you aren’t following the overall definition used for data sharing, then people not reading the definitions will over time be introduced to it by fellow data users and the use of the definition will be more widespread and harmonized (idealistically)

(Jutta) Left-most column “vocabulary suggestions”: this reminds me of the vocabularies currently defined within AC (at a very high level, eg. stem, leave, …). It might make sense to combine these efforts.

[Teresa] “currently defined within AC” - agree, but the list is severely lacking for a bunch of groups. At some point, it may make sense to incorporate AC terms as controlled vocabulary (or to use their term as it is) but for now it isn’t broad enough (and may never be since it is meant for things one can label from an image - not sure how “liver” would fit there….)

(Jutta) Yes, the AC vocab’s are very small currently, though presumably they will grow. “Liver”: a photo or drawing of a liver would be the AC equivalent, wouldn’t it? Thanks Teresa!

Jegelewicz commented 2 years ago

My thoughts on using iSamples scheme

With regard to finding “tissue” - how do people find this now? Only via GGBN? As I see it, tissues can be PreservedSpecimen or MaterialSample and those can currently be used somewhat interchangeably. I maintain that preservation dictates what is “tissue” and what is not and the answer may depend upon what the “tissue” will be used for. Also note that ggbn:materialSampleType is a free text field, equivalent to the huge list of terms we have in our vocabulary suggestions column of the brainstorm document.

With regard to “fossil” - I think allowing for the use of pipe separated terms in specimenType could allow for fossil | organism part, but is that necessary? Isn’t a fossil by definition an organism part? I kinda dislike the fossil term, but people seem to need it (specimenType = organism part and materialType = Rock is a fossil, no?) Don’t want fossils, then only look for stuff with materialType = Organic material?

One thing that seems to be missing from the SampledFeatureType list are terms related to biological features. For instance, the “hindgut content” term is sampled from what? An organism? A whole organism is a sample of a population? A taxon?

Jegelewicz commented 2 years ago

For fun, I added a term to the iSamples SampledfeatureType column = "Biological entity" to use when the SpecimenType = Organism part and I think that works - although I suppose that could also just be left blank as "biological entity" is a bit repetitive of "organism part".

Otherwise - I really do think this scheme is workable, with the addition of a free text dwc:materialSampleType which allows everyone to detail exactly what it is they have. But I feel like I have only had this conversation with myself and that the work of this task group has stalled and I don't know how to get it up and running again. I am ready to propose the new definition for dwc:MaterialSample, the new term dwc:materialSampleID, and deprecation of dwc:FossilSpecimen, dwc:LivingSpecimen, and dwc:Preserved Specimen so that we have something to present at TDWG. @tucotuco @baskaufs @stanblum any advice?

jbstatgen commented 2 years ago

@Jegelewicz in response to https://github.com/tdwg/material-sample/issues/25#issuecomment-1198630351

With regard to finding “tissue” - how do people find this now? Only via GGBN?

Do collections publish their digitized "specimens" to GBIF and their digitized tissue metadata for samples in their crypogenic or fluid or ... collections to GGBN? If so, do all "tissue" have "specimens" associated, so that all occurrences are aggregated in one place, ie. GBIF? Thinking about it, it seems that all "tissue" should have at least a digital voucher, which metadata describing the occurrence/event go to GBIF.

As I see it, tissues can be PreservedSpecimen or MaterialSample and those can currently be used somewhat interchangeably. I maintain that preservation dictates what is “tissue” and what is not and the answer may depend upon what the “tissue” will be used for.

It's also my impression that the intended use seems to define what is called/labeled "tissue" or not.

Strangely, I don't think about the wood blocks in a xylotheque as "tissue" despite the fact, that extraction of ancient DNA and determination of isotopes might work quite well. They simply don't seem to be intended to be used in "this" way. As other "specimens", eg. herbarium sheets, their main use seems to be "preservation" and storage. Is the main difference between "specimen" and "tissue" that tissues are expected to be destructively sampled during their use?

PreservedSpecimen and MaterialSample are quite distinct concepts in my mind. All physical specimens (including tissue) are material and thus have a "MaterialSampleType", see the iSample approach.

Actually, all samples in all collections are "preserved", even the (carefully) dried wood block in a xylotheque and the tissue in crypgenic or fluid storage. For me the questions concerning "preserved" relate to its scope and likely the distinction between "preservation" and "conservation". Viable seeds stored in a seedbank certainly are material, are they also preserved? Or conserved? Or both? Is a microbial strain in a culture collection that needs to be moved to new medium at intervals (since it can't be crypo-"preserved") preserved? Individuals in zoological and botanical gardens? The distinction between collections with preserved dead specimens and living collections seems to be murky.

However, of great practical and scientific interest is the type of preservation. Thus, while PreservedSpecimen might not make complete sense, PreservationType certainly is of high interest.

Are there fields/concepts/terms for conservation types, eg. in DwC?

Regarding the scope of PreservedSpecimen, I will argue that its scope extends to InformationArtifacts in addition to MaterialSamples. Digital images, video or audio recordings, as well as DNA sequences and results of biochemical analyses need to be "preserved" as much as physical specimens. That is, digital files need a backup and archiving strategy, maintenance routines (eg. checking for integrity and readability) and at intervals potentially transformation into new formats for continued accessibility.

With regard to “fossil” - I think allowing for the use of pipe separated terms in specimenType could allow for fossil | organism part, but is that necessary? Isn’t a fossil by definition an organism part? I kinda dislike the fossil term, but people seem to need it (specimenType = organism part and materialType = Rock is a fossil, no?) Don’t want fossils, then only look for stuff with materialType = Organic material?

Not a paleo-person here: aren't quite a lot of fossils actually not "rock" but organic matter? Certainly the shells in "Muschelkalk" are material of organic origin and might have still parts embedded that are organic. [at least I seem to remember that there were real shells, not only prints and such]

One thing that seems to be missing from the SampledFeatureType list are terms related to biological features. For instance, the “hindgut content” term is sampled from what? An organism? A whole organism is a sample of a population? A taxon?

Are "feces" parts of organisms? eDNA I feel is an organism part that has been shed (like feces and skin fragments, hair, hair balls, shed snake skins, ...). Gut content and microbiome might be considered as cases of vouchers for species interactions.

jbstatgen commented 2 years ago

For fun, I added a term to the iSamples SampledfeatureType column = "Biological entity" to use when the SpecimenType = Organism part and I think that works - although I suppose that could also just be left blank as "biological entity" is a bit repetitive of "organism part".

mmh, maybe it would help to make the iSamples scheme easier for me.

Otherwise - I really do think this scheme is workable, with the addition of a free text dwc:materialSampleType which allows everyone to detail exactly what it is they have. But I feel like I have only had this conversation with myself and that the work of this task group has stalled and I don't know how to get it up and running again. I am ready to propose the new definition for dwc:MaterialSample, the new term dwc:materialSampleID, and deprecation of dwc:FossilSpecimen, dwc:LivingSpecimen, and dwc:Preserved Specimen so that we have something to present at TDWG. @tucotuco @baskaufs @stanblum any advice?

Now that the taxonomic classifiation has been differentiated and removed from the MaterialSample scope, which happened during our last meeting for me, I feel that I am only now starting to work my way into the gist of the topic. While I think that the iSamples approach is the right way to go, I'm not sure I completely understand the concepts, their delimitation and vocabularies used within the iSamples scheme. For example, there seems to be quite a lot of ecology/environment/(anthropogenic) origin in the mix that is not clearly differentiated. Though that might be just me.

What I wonder is what collection contact points who are interested and responsible for entering and maintaining collections' records in GRSciColl will think of the iSamples approach for describing the materials in their collections, ie. using the approach for giving an overview over the collections.

Jegelewicz commented 2 years ago

Do collections publish their digitized "specimens" to GBIF and their digitized tissue metadata for samples in their crypogenic or fluid or ... collections to GGBN?

I cannot answer for all collections - I do know that for collections in Arctos we have to create a separate "GGBN occurence core" on the IPT to get the data in an appropriate format.

If so, do all "tissue" have "specimens" associated, so that all occurrences are aggregated in one place, ie. GBIF?

Again, I cannot answer for all and we definitely publish "specimens" to GBIF that only consist of "tissue". I know that "occurrences" at GGBN involve A LOT of duplicates as any given "GBIF occurrence" might contain multiple "tissues". The whole thing is a bit of a mess if you want my personal opinion.

Jegelewicz commented 2 years ago

Is the main difference between "specimen" and "tissue" that tissues are expected to be destructively sampled during their use?

I do not have a good answer for this - which is why I said what I said - I think the meaning of "tissue" depends upon the user of the term.

Jegelewicz commented 2 years ago

Not a paleo-person here: aren't quite a lot of fossils actually not "rock" but organic matter? Certainly the shells in "Muschelkalk" are material of organic origin and might have still parts embedded that are organic. [at least I seem to remember that there were real shells, not only prints and such]

I agree - I cataloged an entire "fossil" collection of Pleistocene mammal material from caves that is in no way mineralized...

Jegelewicz commented 2 years ago

Are "feces" parts of organisms?

I would categorize feces as "traces", along with footprints. Traces are evidence, but not a "part" of the organism. Feces may contain the organism's DNA, but it is mostly other organic material and there is the potential for contamination.

Jegelewicz commented 2 years ago

What I wonder is what collection contact points who are interested and responsible for entering and maintaining collections' records in GRSciColl will think of the iSamples approach for describing the materials in their collections, ie. using the approach for giving an overview over the collections.

But does this make sense? A collection could house many combinations of these terms - I feel like trying to define an entire collection using three terms (we try to do this now with one!) seems unhelpful? But maybe I am over-thinking it.

smrgeoinfo commented 2 years ago

Fossil is tricky, as noted. First the scope notes From the decision tree diagram fossil: Fossilized remains or trace of one or more organisms. Fossilization implies replacement of material by new phases, along with loss of most organic material. 'Fossil' might overlap with ‘piece of solid material’ or Aggregation (if a collection of fossils from a single source).

My question for e.g. the fossil molluscs in the Muschelkalk would be can you analyze the isotopic composition of the material to learn about the environment when the molluscs were alive? if so, they are indeed still the same as sea shells you can pick up on the beach today, if not they are fossilized. The object is a fossil; the material is (now most likely) calcite (a mineral) (it was probably originally aragonite, a CaCO3 polymorph).

A large class of fossils is trace fossils; another big group is casts and molds. Neither of these contain any material derived from the original living creature, and in general the material will be rock, the object is still a fossil.

smrgeoinfo commented 2 years ago

Question-- How to categorize thing like mollusc shells that are not fossil. Is it specimen type:organism part, material : Biogenic non organic material.

smrgeoinfo commented 2 years ago

feces--specimen type: organism product (good fit); material type: (I"m no expert here...) organic material ?

smrgeoinfo commented 2 years ago

I agree that categorizing a collection using the vocabulary would probably require using multiple terms on each facet (unless is a very homogeneous collection). The vocabulary is really scoped to individual specimen.

smrgeoinfo commented 2 years ago

I'm not seeing teh problem with 'tissue' wouldn't it be specimenType : organism part. MaterialType:organic material. I think @Jegelewicz suggestion that we need a 'biological entity' as a sampled feature type is a good one, and would be applicable for tissue.

baskaufs commented 2 years ago

Just back from holiday and responding to this comment:

Google doc convo between Jutta and myself

(Jutta) Is it really critical that terms are used exactly the same across fields and communities? Eg. just had the issue with “thallus” in AC: actually a very different, non-homologue structure in algae, liverworts, fungi and lichens. However, within a given search/work context the term and its use then are very specific.

[Teresa] - We handle this in Arctos with a comprehensive definition. “For X, this means Y and for Z this means a”. It also serves as a kind of warning - don’t just pick all the “thallus” or you will get things you don’t want. To get what you want, you will need to search “thallus” plus something else (i.e. taxa) as well.

(Jutta) If you have an iterative process, with somebody pointing out that you aren’t following the overall definition used for data sharing, then people not reading the definitions will over time be introduced to it by fellow data users and the use of the definition will be more widespread and harmonized (idealistically)

(Jutta) Left-most column “vocabulary suggestions”: this reminds me of the vocabularies currently defined within AC (at a very high level, eg. stem, leave, …). It might make sense to combine these efforts.

[Teresa] “currently defined within AC” - agree, but the list is severely lacking for a bunch of groups. At some point, it may make sense to incorporate AC terms as controlled vocabulary (or to use their term as it is) but for now it isn’t broad enough (and may never be since it is meant for things one can label from an image - not sure how “liver” would fit there….)

(Jutta) Yes, the AC vocab’s are very small currently, though presumably they will grow. “Liver”: a photo or drawing of a liver would be the AC equivalent, wouldn’t it? Thanks Teresa!

I think part of the way out of this complication is to consider the SKOS outlook on concepts as values for controlled vocabularies. The concept is an abstract thing that we describe with a definition and make easier for people to recognize with a label. That is a somewhat different outlook from considering values for controlled vocabularies to be particular strings. In the case of "thallus", the problem arises because of the limitation of trying to use the same controlled value string to represent what is actually several different concepts.

As I understand it, the SKOS approach would be to define a separate concept for each non-homologous kind of "thallus" (i.e. context-specific definitions). For the English labels, I would use something like "thallus (algae)", "thallus (fungi)", "thalus (lichen)", etc. Note that I said labels, not controlled value strings -- we would not expect people to put these label strings into spreadsheets because there would be too many possible variants (with and without parentheses, differing spaces, etc.). The labels are for humans to see and choose, not for them the type into spreadsheets. In the pure SKOS world, we would refer to the different concepts unambiguously by their unique IRIs.

TDWG recognizes the fact that most of the non-SKOS world expects to be able to be able to use a controlled string value and not an IRI, so for each concept in a TDWG controlled vocabulary, we designate both an IRI and a controlled value string that is unlikely to be typed incorrectly to denote the concept. The convention up to this point is to use lower camelCase for the controlled value strings (see for example the controlled values listed in http://rs.tdwg.org/dwc/doc/pw/). So if we defined a separate "thallus" concept for each group, those concepts would probably have controlled value strings like "thallusAlgae", "thallusFungi", "thallusLichen", or something like that where it would be clear exactly how they should be typed.

In the case of the Audubon Core subjectPart values, we tried to take the practical approach in defining parts, reusing the same concept for parts that wouldn't necessarily be evolutionarily homologous if they have the same function and location on the organism, e.g. wings on birds, bats, and insects; legs on insects and vertebrates. This is really for convenience in sorting and searching and to keep the vocabularies small and uncomplicated enough that they are usable by non-experts. If someone wanted to look at all of the pictures of "wings" in a collection, they could get both bird and insect wings if they weren't screening by taxonomic group.

If this approach turned out to be problematic, we could split the concepts and then relate them using SKOS properties like skos:closeMatch. So whether "thallus" should be a single concept applied across all of the groups mentioned above or four separate concepts would need to be a judgement made by experts on those organismal groups. In the case of the draft vocabulary, the existing terms were ones that the assembled task group members felt comfortable defining. If we felt there wasn't adequate expertise within the group, we passed over minting the terms for now.

It is correct that the ac:subjectPart controlled vocabularies are intended to be extensible. We hope that we can add terms for as many organism groups as possible once we have sufficient expertise and user testing on images in the wild. Fairly early on, the task group decided to pass over terms for internal organs, cross sections, etc. because that would be expanding the scope way beyond what we thought we could handle and because we wanted to mint only values that we thought people would actually use with images in their collections. Therefore, if there were collections that had many images of internal organs from dissections, tissue samples, microscopic cross sections, etc. and they wanted to characterize them using ac:subjectPart, we would add terms necessary to describe views of these parts. But as of the time when the task group was operating, we didn't have any examples or participating task group members working with those kinds of images.

Wearing my Technical Architecture Group (TAG) hat, I would hope that if these subjectPart values could be reused in contexts other than for characterizing views of images, that would be better than developing a separate, parallel controlled vocabulary that duplicated what's already in the Audubon Core vocabularies. But the existing values were designed to satisfy the use cases that the task group had when it was doing its work, and at that time it was only to describe views in images. So it's a matter of discussion to determine if they are extensible enough to be used in the MaterialSample context as we are discussing here.

smrgeoinfo commented 2 years ago

+1 on distinguishing a concept from the labels used to communicate that concept to users. The concept is a mental construct, and the only way we have to communicate it in a computer system is via a clear, unambiguous, logically coherent definition; for computer systems we need to use the IRI to identify the concept; in user interfaces we can use the labels for the concept that are appropriate to the users for that human interface.

I don't think this really impacts the proposed material, specimen and sampling feature type vocabularies that are proposed, except that there might be alt labels for different user groups. Trick is then to some how communicate the context in which each alt label is appropriate...

baskaufs commented 2 years ago

For what it's worth, I received this today. There is a lot of stuff there from many sources. Not sure how directly usable it is, but it may have aggregated something that's useful in this context:

Dear users and specialists in semantics for ecology and biodiversity,

We are pleased to announce the release of the new edition 2.0 of the Biodiversity Thesaurus. It is the result of a complete revision of the previous version 1.2 (managed with VocBench3 editor), and is still fully bilingual. The 827 concepts are grouped in 82 collections, by semantic categories, thematic fields and EBV classes (essential biodiversity variables) and concern all terrestrial and aquatic ecosystems. Version IRI: http://data.loterre.fr/ark:/67375/BLH/2.0 ; permalink (SKOS): https://www.loterre.fr/biodiversite-2/

Compared to the 2020 version, it has been enriched by 212 concepts, 1000 alternative/synonymous terms and 2450 hidden terms, with a redesigned hierarchical structure, reducing the number of top-level concepts by half. Sources of definitions have been checked and updated, with additional notes for polysemous terms or emerging concepts. Alignments (made with OnAGUI) were completed with recent versions of AGROVOC (2022) and GEMET (2021)] and achieved with two new semantic resources: ENVTHES thesaurus and ENVO ontology. The thesaurus V.2.0 is freely browsable and downloadable (SKOS-RDF/XML, SKOS-Turtle, CSV, PDF) on Loterre platform (Linked open terminology resources) with multilingual Skosmos UI displaying groups/collections. It is made available (BIODIVTHES) on AgroPortal (English search only) and LifeWatch EcoPortal (English/French search), both showing hidden labels but not displaying collections. The metadata are also accessible via FAIRsharing, BARTOC and the I-ADOPT Catalogue of Terminologies. This new version has been recorded as a lexical/conceptual resource in the European Language Grid platform and added in Ortolang repository.

As a domain-specific terminology with an interdisciplinary scope, Biodiversity Thesaurus is positioned at an intermediate level between general multilingual vocabularies and specialized/monolingual databases vocabularies. For this reason, it can constitute a bridge of thematic/overarching concepts contributing towards harmonization and mapping of semantic assets on ecology and biodiversity, and thus fostering semantic annotation and data integration. Leveraging its bilingualism and synonymic richness, this could be done through alignment functionalities with other environmental terminologies in EcoPortal or AgroPortal. The hierarchical relationships between concepts are intended to be only generic, partitive or instantial (in accordance with ISO 25964), which brings it closer to a knowledge organization system or an onto-terminology.

Initially, Biodiversity Thesaurus is a linked open terminology resource facing semantic heterogeneity of biodiversity databases and information systems. It is intended to enrich data description, focusing on the basic key concepts of ecological and environmental entities or parameters related to biodiversity conservation, from gene level to global scale of Earth system. On the other hand, it is not addressing geographical, chemical, soil or taxonomical nomenclatures. An associated documentation has been archived in HAL: https://hal.archives-ouvertes.fr/hal-02907484v2 (in related file, the recently updated inventory of "Semantic and Terminology Resources in Environmental and Biodiversity Sciences").

The objective is therefore to promote semantic interoperability ("I" of FAIR) and knowledge representation through the use of a controlled vocabulary which itself complies with the FAIR principles (I2 FAIR sub-principle), as defined by the RDA I-ADOPT working group, AgroPortal (Ontology FAIRness evaluator), EcoPortal (S4BioDiv 2021) or by the eLTER network for the FAIRer description of environmental datasets. See also: https://hal.archives-ouvertes.fr/hal-03264850

Don't hesitate to send us your remarks if you notice any inconsistency or anomaly, so that we can take them into account in a future version and improve its FAIRness.

smrgeoinfo commented 2 years ago

looks like a vocabulary worth looking at. Problem is none of the links get me a SKOS version I can look at in Protege. I'm not interested in an html interface.

Jegelewicz commented 2 years ago

The group believes this is a good start - need to propose the properties with definitions, then review and propose the vocabularies.

tdwg / material-sample

iSamples review #25