tdwg / material-sample

A Task Group of the Observations and Specimen Records (OSR) Interest Group
2 stars 0 forks source link

controlled vocabulary for materialSampleType #24

Closed baskaufs closed 2 years ago

baskaufs commented 2 years ago

As requested in the 2022-03-16 meeting, I have created a draft controlled vocabulary for the proposed materialSampleType term based on the existing specimen types. It can be viewed as a list of terms document and in tabular form.

I believe the decision was to start with the existing specimen types, with the option of adding other values if we could agree upon what they should be. The vocabulary is easy to expand by just adding more rows to the source CSV (linked above).

Additional things to be resolved:

baskaufs commented 2 years ago

reference #14

baskaufs commented 2 years ago

Ping @Jegelewicz

Jegelewicz commented 2 years ago

@baskaufs THANK YOU! We can take a deep dive tomorrow?

RogerBurkhalter commented 2 years ago

@baskaufs I will comment on the definition of FossilSpecimen as "A preserved specimen that is a fossil". I advocate removing the term "preserved". First, the process of fossilization is preservation in and of itself. It is also a "natural" process, not an anthropogenic process such as described under most of dwc:preparations (also using the term "preserved", many paleo workers take issue with fossils being in that term as well, for fossils preparation has nothing to do with how the item is stored). The concept of SkinSpecimen or SkeletalSpecimen representing Skin or Skeletal preservation/preparation does not make sense. There may be fringe instances where a recently living entity may be naturally preserved, such as a desiccated mammal or bird, or a frozen Mammoth body that is not otherwise "turned to stone" that is usually envisioned as fossils. Perhaps these fringe instances can be included in PreservedSpecimen with preparation type as naturally desiccated or naturally frozen. They do not represent fossils (although some would argue the Mammoth is a fossil). Speaking of fossil, the examples include items that are not fossils but are instead examples of behavior. These are coprolites, gastroliths, and ichnofossils. These are all forms of ichnofossils and as such cannot (usually) be directly attributed to any particular species, and should perhaps be a form of specimens of a "preserved observation"?

cboelling commented 2 years ago

I would like to understand why, in order to specify values for dwc:MaterialSampleType a new concept scheme with newly defined resources (a.k.a terms) is preferred (including concept scheme infrastructure like name spaces, IRIs). At first glance it seems that what is informative about the newly minted resources can also be expressed with the existing terms, e.g. using http://rs.tdwg.org/dwc/terms/version/LivingSpecimen-2018-09-06 or its associated label ("Living Specimen") or adaptations thereof. Couldn't those be used as values?

Jegelewicz commented 2 years ago

Perhaps add a requested term?

environmentalSample

Jegelewicz commented 2 years ago

Also, GGBN will be concerned looking for "tissue".....

Jegelewicz commented 2 years ago

I will comment on the definition of FossilSpecimen as "A preserved specimen that is a fossil". I advocate removing the term "preserved".

I agree with removing preserved from this definition.

However, won't we have some FossilSpecimens that are also PreservedSpecimen? Like these fossil scutes prepared as thin sections? https://arctos.database.museum/guid/NMMNH:Paleo:16545

I guess this also means there will need to be a whole other term for the description of the material "scute"? Should we be looking at that here or passing that down to the next task group?

Jegelewicz commented 2 years ago

the examples include items that are not fossils but are instead examples of behavior. These are coprolites, gastroliths, and ichnofossils. These are all forms of ichnofossils and as such cannot (usually) be directly attributed to any particular species, and should perhaps be a form of specimens of a "preserved observation"?

I suggest we probably need a type for this in controlled vocabulary, but I also suggest "trace" rather than "PreservedObservation" which seems like it could also be used for a photograph. Trace could also cover things like scat, molds of footprints and such.

deepreef commented 2 years ago

However, won't we have some FossilSpecimens that are also PreservedSpecimen

I think there's a subtle distinction between "preserved" and "prepared" (or "curated"). When you think about it, every physical object is in some way preserved. There are varying degrees of the duration of preservation. Fossils, through mineralization, are preserved for millions of years. Specimens treated with formaldehyde and/or alcohol are preserved for (potentially) centuries. Tissue samples stored in DMSO are preserved for decades(?) A fresh carcass in an air-conditioned room is preserved for days or perhaps weeks.

I guess my point is that the word "preserved" is a bit meaningless and/or implied by the word "specimen". What I think people are actually interested in is the current state and history of "preparations". That is, what sorts of actions have been performed on a physical thing? Some of these actions are intended to extend the duration of preservation (e.g., formalin, alcohol, etc.). Some of them are intended to allow examination or analysis (e.g., thin sections, tissue extractions for DNA, etc.)

I get that we want to distinguished "Preserved" Specimens from "Living" Specimens, but it would seem to me that the alternative of "Living" is actually "Dead", not "Preserved". Yeah, I know what we mean by "Preserved" specimen; but given that we're taking the time to completely restructure how these terms are applied; perhaps now is a good time to rethink how we parse out the different kinds of MaterialSample instances?

One option is to do as we have been doing, which is sort of "overload" a basic term like materialSampleType to capture clues about preservation method, living vs. dead, mineralized vs. actual biological material, whole organism vs. part of organism vs. aggregate of multiple organisms, vs. which part of an organism it is, etc. I worry that trying to capture all these disparate properties in some controlled vocabulary of terms squeezed into materialSampleType might make things more complicated, rather than more simple.

Maybe a good topic of discussion for today's chat would be "What parameters are we trying to represent in the values of materialSampleType?

Jegelewicz commented 2 years ago

Maybe a good topic of discussion for today's chat would be "What parameters are we trying to represent in the values of materialSampleType?

Added to the agenda!

smrgeoinfo commented 2 years ago

see https://github.com/baskaufs/msc/issues/1#issue-1210233616

smrgeoinfo commented 2 years ago

https://github.com/tdwg/material-sample/issues/24#issuecomment-1104068355 's question 'What parameters are we trying to represent in the values of materialSampleType?' is important. For a controlled vocabulary, it is very useful to have a clear definition of the use case for the vocabulary, its scope (biological samples, any material sample, Earth Materials....), what are the criteria for differentiating the terms, are the terms hierarchical, do the terms cover the scope (covering), can terms have overlapping meaning (Unique, unambiguous).

albenson-usgs commented 2 years ago

Perhaps add a requested term? environmentalSample

I think this is too broad. I would like to use the examples from https://github.com/tdwg/dwc/issues/40 e.g. Examples: envo:soil, envo:sediment, envo:saline water

I think being able to distinguish a soil sample vs. a saline water sample vs. a freshwater sample will be important to eDNA data providers.

Jegelewicz commented 2 years ago

'What parameters are we trying to represent in the values of materialSampleType?' is important. For a controlled vocabulary, it is very useful to have a clear definition of the use case for the vocabulary, its scope (biological samples, any material sample, Earth Materials....), what are the criteria for differentiating the terms, are the terms hierarchical, do the terms cover the scope (covering), can terms have overlapping meaning (Unique, unambiguous).

In the second meeting yesterday, we discussed this. Those present could see the need for thinking beyond the currently used "GBIF basisOfRecord" terms and @albenson-usgs suggested that we take a step back and start by creating a list of terms we think we might find or want to place in materialSampleType. So, I have started a Google Sheet and I would like everyone to think about what they might place in this vocabulary. Just add your terms to the bottom of "suggested vocabulary". We can then deduplicate the list and start categorizing to see if we can build a more broad and useful vocabulary. In addition, I think it would be helpful for each of us to think about the quote above. What do we expect from the vocabulary for this term?

baskaufs commented 2 years ago

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here. The terms are intended to be used as values for ac:subjectPart, which indicates the part of the organism being photographed, but it could generally refer to organism parts in other contexts.

Jegelewicz commented 2 years ago

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here.

Added to Google Sheet

baskaufs commented 2 years ago

@cboelling To respond to your question

I would like to understand why, in order to specify values for dwc:MaterialSampleType a new concept scheme with newly defined resources (a.k.a terms) is preferred (including concept scheme infrastructure like name spaces, IRIs). At first glance it seems that what is informative about the newly minted resources can also be expressed with the existing terms, e.g. using http://rs.tdwg.org/dwc/terms/version/LivingSpecimen-2018-09-06 or its associated label ("Living Specimen") or adaptations thereof. Couldn't those be used as values?

The controlled vocabulary as I generated it follows the conventions that have been established within TDWG for ratified controlled vocabularies. One of the goals of that system is to eliminate longstanding confusion between term labels, IRI local names, and the controlled value strings that people should use in spreadsheets or tables. These three things have been badly conflated in the past. That's a problem because TDWG is an international organization and labels are (or should be) available in many languages, whereas there should be a single controlled value string used by everyone as a value for the property. You can see examples under the three existing controlled vocabularies within Darwin Core (for establishmentMeans, pathway, and degreeOfEstablishment), available from the top navigation bar on the Darwin Core website. The intent is for this vocabulary to follow the same pattern. These controlled vocabularies now have some label translations available at https://tdwg.github.io/rs.tdwg.org/ .

The IRI local names are intentionally opaque so that no one is tempted to try to use them as controlled value strings. But since there are IRIs and JSON-LD using them, one can encode SKOS relationships among concepts (such as skos:broader) in a machine-readable way. See https://tdwg.github.io/rs.tdwg.org/cvJson/pathway.json for example.

jbstatgen commented 2 years ago

Coming from GRSciColl and working on describing "Institutions" and "Collections", I added a couple of terms to the end of the list, as well as an additional sheet with the two existing vocabularies for the fields/properties describing "Collection": "Content types" and "Preparation types".

Both input fields don't work, that is, a csv-download of the information stored in GRSciColl shows that both fields are generally empty, or users add information that doesn't make a lot of sense when compared with the rest of the entered information. Obviously they need to be redesigned. Nevertheless, they can provide an idea and perspective about dimensions associated with describing MaterialSampleType and granularity.

For further background, since there is a bit of overlap too, this is my proposal for how to describe "Institution" GRSciColl_Vocabs . Comments are very much welcome, though since out of scope here, please to me directly.

jbstatgen commented 2 years ago

Refer to existing draft controlled vocabulary for organism parts here and organized by organism group here. The terms are intended to be used as values for ac:subjectPart, which indicates the part of the organism being photographed, but it could generally refer to organism parts in other contexts.

@baskaufs ... no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont)

baskaufs commented 2 years ago

@jbstatgen

no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont)

We begged people to participate in this task group and no fungi experts joined. So we only have values for organism groups where someone suggested them.

The controlled vocabulary is intended to be extensible, so we'd be happy to add fungi if someone will suggest the terms, test with images, etc.

baskaufs commented 2 years ago

@dr-shorthair

I'd suggest being more clear about which strings are keys, in what context; and which strings are being stored as 'annotations' related to some prior context.

I don't understand what you are saying. Please refer to the governing specification, Sections 3.3.3.1 ("Controlled value") and 4.5.4 and offer suggestions on how they need to be clarified.

The approach taken there was a compromise between how concept metadata are described in "pure" SKOS thesauri and the actual practice within TDWG of simply using a certain plain text string as a value from a "controlled vocabulary".

dr-shorthair commented 2 years ago

Apologies - my comment was intended to be in the context of IDs. I'll try to find the thread I thought I was responding to. We can delete these bits of this conversation so that this issue does not have a confusing sub-thread.

jbstatgen commented 2 years ago

... no fungi ... (eg. thallus, fruiting body, vegetative reproductive structure, mycelium, symbiont) ... The controlled vocabulary is intended to be extensible, so we'd be happy to add fungi if someone will suggest the terms, test with images, etc.

@baskaufs What would it take to add the above terms to your vocabulary?

A) If it is a matter of the amount of information present in this overview and the first two links in your initial post, I could provide this for the above terms and learn along the way about how to construct and publish vocabularies correctly.

B) Though, there wouldn't be any testing and community agreement supporting the contributed terms. For that, the vocabularies need the mycologists and lichenologists eg. from the citizen science initiatives for fungi.

C) This is the Task Group you were mentioning. Your report for 2021 suggests that you are wrapping up and might not want to reopen the process.

Not sure where the balance in all of this is right now.

Jegelewicz commented 2 years ago

@dr-shorthair no worries - just copy and repost wherever you want to comment!

smrgeoinfo commented 2 years ago

I spent some time studying the draft controlled vocabulary (tabular form), and have some thoughts.... First, as a geologist and engineer, I don't know what a lot of the terms mean and didn't have time to look them all up, so this analysis is based on terms I think I understand.

Perhaps a next step here is looking for some more general categories to lump categories into a vocabulary with a manageable number of classes, say on the order of a 100 or so. And make them hierarchical. Maybe something like Organism > plant organism > plant organism part along one branch.

factoring specimen type along the lines of say ... object type, material type, sampled feature, taxonomic class, anatomic class... would allow defining a smaller set of categories, and then allowing users to build detail vocabularies that map into combinations of those high-level categories.

RogerBurkhalter commented 2 years ago

@smrgeoinfo many of the terms you cite as adjectives are indeed individual bones (angular, articular, basibranchial, basioccipital, exoccipital, frontal) and may be important, especially for vertebrate paleontology where a complete skeleton is not found or only isolated bones are known. I do agree the list is painfully long but, as is, incomplete with all of the possible terms. The list I use in my CMS is hierarchical has a "modifier" to handle adjectives like anterior partial, left lateral partial, etc., because not everything is complete in the paleo realm.

smrgeoinfo commented 2 years ago

So in a hierarchical vocabulary, one might have something like: whole organism > vertebrate organism > vertebrate body part > vertebrate bone > endochondral bone > basibranchial bone > Gymnura micrura basibranchial medial plate. For a TDWG materialSampleType vocabulary, the question is what is the useful level of granularity in this hierarchy; more detailed categorization would then fall in some free text field, or use a local, more granular vocabulary specific to some sub-community.

tucotuco commented 2 years ago

Rather than try to build the vocabulary for anatomical parts, I would recommend the use of a SKOS-ified version of UBERON, the construction of which could be scripted and updated at any time.

RogerBurkhalter commented 2 years ago

@tucotuco UBERON, works for the living, not so well for the fossil groups. It is a great start and I will explore further.

albenson-usgs commented 2 years ago

Perhaps a next step here is looking for some more general categories to lump categories into a vocabulary with a manageable number of classes, say on the order of a 100 or so.

I want to make clear that when I suggested this task that is what I had intended would happen. In the How Did It Die Task Group this is what we did to come up with the vocabulary for causeOfDeath, see here where we have a full slate of what's currently in some of the databases for cause of death and then the lumping categories of Natural - abiotic, Natural - biotic, Anthropogenic, Unknown. I would hope we could get to a lumped list of 10 or so personally :-) We are going to overwhelm data providers if we make the list too long.

baskaufs commented 2 years ago

@jbstatgen I've started a new issue https://github.com/tdwg/ac/issues/240 in the Audubon Core repository regarding fungal parts to avoid getting this one off the track. We can continue the discussion there.

Jegelewicz commented 2 years ago

The categories from GRSciColl Collection ContentType seem broad and relevant. Could these terms also be used as materialSampleType?

That may seem repetitive, but any given collection probably includes more than one of the ContentType(s), allowing the addition of this "tag" to every record would seem potentially useful. However, they still seem oddly specific in some cases. How about the broader categorical terms?

Archaeological Biological Human Derived Earth Planetary Paleontological Record

Really, it seems like the broader terms belong with the collection description and the more detailed values with the individual records, but I could see it going either way...

smrgeoinfo commented 2 years ago

This mapping includes the GRSciColl terms.

jbstatgen commented 2 years ago

Really, it seems like the broader terms belong with the collection description and the more detailed values with the individual records, but I could see it going either way...

Wouldn't this be the perfect situation for an ontology, ie. a hierarchical classification? In that way one could automatically generate the aggregate of a collection's contents at any level.

Archaeological Biological Human Derived Earth Planetary Paleontological Record

I like this high-level approach, though there are a couple of reasons why I would like to see the list of terms modified.

  1. In our field we are dealing mostly with things "Biological". Thus, basically any record could get a tag "Biological", which then isn't informative anymore. Should we go that high, the list would be, it seems
Geological
Biological
Anthropogenic

[Record (what does "Record" refer to? Is that a subclass of Anthropogenic?)]

In a hierarchical approach this could be Level 1 With Level 0 being "material sample" vs. "information artifact".

  1. Level 2 within "Biological" will be most informative for many of our use cases. Here I am suggesting
Virology
Microbiology _(Would one want to split Bacteriology from Microbiology? That is, Bacteria, Archaebacteria versus the rest of all those evolutionary dispersed lineages of microorganisms?)_
Mycology
Zoology _(How important is an immediate split into invertebrates - vertebrates?)_
Botany
Paleontology _(human remains go into Anthropology, right?)_
Biomedical _(or any term referring to human biology - and yes, actually this is Zoology)_
  1. Level 2 within "Geology": a distinction between planetary vs. extraterrestrial seems to be of interest, though I'm not familiar what the correct/widely used terms might be. For example
Planetary/Terrestrial/Earth
Extraterrestrial with WithinSolarSystem vs. ExtrasolarSystem 

Alternatively, would it be "Geology" vs. "Astronomy"?

  1. Level 2 within "Anthropogenic": for me this is anything made by humans. Also, the distinction between archaeology and anthropology doesn't seem to be clear-cut. Eg. is the https://en.wikipedia.org/wiki/Ahrensburg_culture down the road just outside town "Anthropology" or "Archaeology"? - Its "a bunch of rocks in a circle and a couple of arrow tips" (Archaeology or cultural anthropology?). I'm not sure how many human remains/bones were found (Anthropology?), if any - though that seems to be dependent mostly on chance. Terms for a vocabulary might include
Anthropology/Archaeology
Cultural Artifacts
Library/Literature
  1. "Record": Would users understand something like "Cultural Artifact" or rather "Information Artifact/digital object" under this term? If this refers to a digital object, then it should be removed here and moved as subclass into "Information Artifact" - Digital Objects/DES records would go into "Information Artifacts", together with images, audio/video recordings, etc.?
jbstatgen commented 2 years ago

This mapping includes the GRSciColl terms.

@smrgeoinfo Could you please change the share settings for the file? Currently I can't access it and might not be the only one. Thanks a lot, Jutta

smrgeoinfo commented 2 years ago

Jutta-- sorry! permissions updated, Anyone with link should be able to comment

Jegelewicz commented 2 years ago

@smrgeoinfo can we just add this to the original file? I'd prefer to just have one.

smrgeoinfo commented 2 years ago

Done

albenson-usgs commented 2 years ago

I would like to add saline water, non-saline water?, soil, and sediment but I'm not sure where to add them to the document? They aren't necessarily database uses but I would see them as materialSampleTypes that eDNA collectors would want to use. Should I add them to both the database uses tab and the iSamples mapping tab?

Jegelewicz commented 2 years ago

I don't think it matters that they aren't currently in use - just add them to the database uses tab.

Jegelewicz commented 2 years ago

Level 2 within "Biological" will be most informative for many of our use cases. Here I am suggesting

Virology
Microbiology _(Would one want to split Bacteriology from Microbiology? That is, Bacteria, Archaebacteria versus the rest of all those evolutionary dispersed lineages of microorganisms?)_
Mycology
Zoology _(How important is an immediate split into invertebrates - vertebrates?)_
Botany
Paleontology _(human remains go into Anthropology, right?)_
Biomedical _(or any term referring to human biology - and yes, actually this is Zoology)_

But aren't these things really part of identification (with the exception of "Paleontology")? Would we be duplicating whatever is held in dwc:higherClassification?

A list (concatenated and separated) of taxa names terminating at the rank immediately superior to the taxon referenced in the taxon record.

While the terms in the list will not be found exactly in dwc:higherClassification, they can be inferred from there. Or are we to assume that any given dwc:MaterialSample may not have an associated dwc:Identification? If they do, how would this list be more informative than dwc:Identification plus dwc:higherClassification?

Jegelewicz commented 2 years ago

Some other vocabs to consider ggbn:materialSampleType - https://rs.gbif.org/extension/ggbn/materialsample.xml

dwc:preparations - https://dwc.tdwg.org/terms/#dwc:preparations

ADBC KindOfUnit - https://terms.tdwg.org › wiki › abcd2:KindOfUnit (504 Gateway Time-out)

Jegelewicz commented 2 years ago

Closing as discussion has now moved to #26 #27 and #28