obi-ontology / obi

The Ontology for Biomedical Investigations
http://obi-ontology.org
Creative Commons Attribution 4.0 International
75 stars 27 forks source link

Extensions for Value Specifications #818

Open GullyAPCBurns opened 7 years ago

GullyAPCBurns commented 7 years ago

Based on the discussion on the call, I propose to add the following additional terms for value specifications in OBI based on previous work in OoEVV (ontology of experimental variables and values).

I propose the following additional children for 'value specification' (OBI_0001933)

  1. 'boolean value specification' - values are simply true / false
  2. 'file value specification' - values are data files (subtypes can specify file formats)
  3. 'tree value specification' - range of possible values form a hierarchical tree structure (e.g. taxon)
  4. 'ordinal value specification' - values are ranked scores (either named or unnamed)
  5. 'composite value specification' - values have defined substructure (e.g. a value specification for blood pressure would need both diastolic and systolic subfields)
  6. 'natural language value specification' - values are written in long form natural language (e.g. doctors notes on a diagnosis)

Possible subtypes for categorical value specification

  1. 'constrained value specification' - values are drawn from a set.
  2. 'name' - values denote a string-based name of an entity (this is so general, it should be a class)
  3. 'string identifier' - values denote a string-based identifier
  4. 'presence' - values denote presence of absence of an effect

Possible subtypes for ordinal value specification

  1. 'ranks with maximum ordinal value specification' - values are assigned with a defined maximum.

A subtype of this would be 1a. 'named rank ordinal value specification' - ranked values are given names and additional semantics. This will be very widely used. I think Dan's example hinted at this.

Possible subtypes for scalar value specification (note that measurements based on these values often do not have units).

  1. percentage - values are fractions expressed as percentages
  2. numeric ratio - values are derived from evaluating comparisons between two other data elements
Public-Health-Bioinformatics commented 7 years ago

Thanks for suggesting this ... sounds good. I note that these value specifications would/will/should fall in line with the types of variables at work. STATO did a pretty good job of outlining them under 'variable type':

screen shot 2017-04-11 at 2 48 58 pm

I like that "ordinal variable" is a subclass of categorical variable. It seems to me that we can upgrade a set of categorical choices into an ordinal value specification by having the "a before b" "b before c" ... order relations defined between them. All those Likert scale surveys OBI could then describe!

A "boolean value specification", being true/false, would be about a dichotomous variable datum. But more generally, a "binary value specification", also about dichotomous variable datums, would describe any two-valued thing, long/short, hot/cold, sunny/shady etc. Indeed these can be mapped to boolean; but these may be mapped to some other system's ordinal variable (e.g. hot/lukewarm/cool/cold) depending on the analysis at hand.

One other case: a "count value specification", which isn't continuous in terms of real-world observations, so that suggests a "discrete variable", different from a continuous variable. But as mentioned elsewhere (discussion on APOLLO_SV I recall) a count may be estimated or averaged via continuous variable modelling (e.g. "I estimate 2.5 babies will be observed per couple") so they are connected mathematically.

Gully, your "Tree value specification" matches what I threw into GenEpiO: a "categorical tree specification" precisely so I could point to some branch of an ontology like NCBI Taxon. In GenEpiO's case this can also cover flat lists (a class and its immediate subclasses). I would say that ontology "is a" hierarchies are actually (if done well) giant categorical trees!

Lastly, I might step out on a limb and say that any tree value specification choice, considered on its own as a datum about some entity (in a survey say, e.g. did you eat seafood? ... did you eat a crustacean? Did you eat shellfish? ...) Is actually a feature, aka your "presence", or binary value specification, in addition to its wider tree value specification context.

About numeric ratio - this came up over on PCO, discussing population density. In my opinion, to describe a ratio is to describe its numerator and denominator, literally, and this supports a description of its concept. Population density = population count (census) / designated area (survey). (Contrast this with ecological footprint: Designated area / population count). So I would formalize "has numerator" and "has denominator" relations to support this.

GullyAPCBurns commented 7 years ago

OK. Thanks for the feedback. I'm operating from a logical model for these elements derived from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4474486/

In it, we make some important distinctions which I think are valid for discussion of we shuold think about representing data in OBI.

  1. OBI doesn't talk about variables except in a confined scope:

    • study design dependent variable
    • study design controlled variable
    • study design independent variable all of these are defined in the context of a study design. I don't feel that it is cleanly defined and needs more work (which I'm hoping to try to do here).
  2. I differentiate between the following elements (A) a quality that is a property of some entity in the world (as defined by PATO or some other ontology. (B) a variable which can measure some value associated with that property (with specified units etc). where the values are determined by... (C) a scale which describes the kinds of values that a specific variable can take.

I think this idea maps onto OBI by saying that Qualities are like any PATO property or any class that you want to assign a measurement to. OBI uses Measurement_Datum subclasses as variables and value_specification classes as scales.

I agree on the whole with the distinctions you make. I'm not opposed to making 'ordinal value specification' a subclass of 'categorical value specification'. I think you do need two classes for binary and boolean values ('male/female' is not the same as 'true/false'). I agree that it's probably important to make the distinction between continuous or discrete values (a little hard to tell)... and yes, trees come up in data all the time, so we should have a value specification class to describe it.

I would argue that your example of a question in a survey ('Do you eat shellfish?') is actually an example of a 'composite value specification' with a subquestion that is described by a boolean value specification. Questionnaires are often quite well defined with different sections so having a way of grouping these elements together is important.

I like your specification of a ratio computation. Perhaps we might be able to describe equations / formula in this way? That might be pushing the limits of the use case though.

Let's keep chatting. I'll work up an example and suggest it to the main OBI group.

Public-Health-Bioinformatics commented 7 years ago

About 'Did you eat seafood?', I see your point that those subquestions have dependencies that suggest a composite structure. Alternately, all the more detailed questions could, if using a Food Ontology (e.g. FoodOn) be represented as selections within one tree value specification having a "seafood" root that details edible subclasses - a single hierarchic picklist.

About equations, I bet equations in a number of cases could be detailed by numerator/denominator elucidation. A paper that explores this is Hajo Rijgersberg's Ontology of Units of Measure and Related Concepts (http://www.semantic-web-journal.net/sites/default/files/swj177_7.pdf), specifically his "Use Case 4" (UC4), representing and checking formulas. But going further in this direction does start to look like a merger of OBI and an ontology of dimensional analysis. There all derived units have base units taken to some positive or negative power, i.e. as numerator or denominator.

GullyAPCBurns commented 7 years ago
  1. On composite scales: The goal is to capture the structure of how the measurements are presented to the subject since that's how instruments like questionnaires are developed.

  2. There's a lot of complexity in these issues which scares me in terms of making something practical.

Thanks for the references!

GullyAPCBurns commented 7 years ago

Mark Miller has developed a model of ordinal value specification described here:

https://github.com/turbomam/ontology_collaboration/blob/master/OBI/ordinal_gleason/gleason_merged_reduced.owl

GullyAPCBurns commented 7 years ago

Here's the screenshot of the model

GullyAPCBurns commented 7 years ago

Following the previous discussions from James on modeling value specifications here:

https://docs.google.com/document/d/10Mt3zb73iGhM6j1pGbP7rGE4CPSKiQBGG_TM9hJcczU

With a couple of images:

GullyAPCBurns commented 7 years ago

Following this diagrammatic style, I suggest that we should perhaps use a simple extension of James' model to represent ordinal data with defined data properties to denote the characteristics of values. Thus, we can model a single ordinal value like this:

Note that consistent with James' model of height previously, we specify the instance value as specific to John's prostrates' Gleason score, not the Gleason score in general. This means that we just specify Gleason scores in general as being 'ordinal value specifications with possible rankings from 1-3 with a maximum score of 3 that is about a phenotypic quality pertaining to cancer.

In this approach, we keep things simple and we don't model the Gleason scoring methodology as a measurement scale (but see below).

turbomam commented 7 years ago

Here's a visualization showing a revised model of histology value specifications. I'll be eclipse-watching on Monday, but I'm looking forward to discussing this more later in the week on on the 28th.

I recommend right-clicking to open a large rendering in a separate window.

Whats's new?

Not addressed here:

histo_grade_cat_class

cstoeckert commented 7 years ago

Discussed further on Aug. 21, 2017 call. OK with having an IRI for an instance of a value (e.g., a Gleason tumor grade) that can be re-used. More complicated is scalar where values are drawn from all numbers. Don't want to make IRIs to reuse for all numbers. Rules to compare categorical values will be different for comparing scalar values.

cstoeckert commented 7 years ago

For ordinal value comparisons, how do we encode ranks? Put in data or annotation property. Damion gave an example where adjacency may be needed.

Public-Health-Bioinformatics commented 7 years ago

The above diagram reminds me of one SIDE question I had - hopefully someone has a quick answer to this. I've wanted to use a relation like "derives from" in a number of situations where a specimen was a part of something but had been extracted from it (as above "OrganSectionJohnsColon" shows) so no longer is a part in the direct physical sense. But I avoided "derives from" since the RO definition is:

a relation between two distinct material entities, the new entity and the old entity, in which the new entity begins to exist when the old entity ceases to exist, and the new entity inherits the significant portion of the matter of the old entity

Should this definition be loosened up a bit to not imply that the "old entity ceases to exist" (in this case, "John")? How else to describe biopsies and removed organs? Or do we need a new relation: "extracted from"?

Public-Health-Bioinformatics commented 7 years ago

For the record here's the ordinal value mapping problem diagram (I'll note that the mapping between the two schemas is entirely my guesswork). As Gully mentioned, these could be treated as two separate ordinal scales with ranks separately assigned to each list's elements, and the mapping between them becomes a separate problem.

screen shot 2017-08-21 at 11 11 44 am

We'll just need to anticipate that a number of ordinal value specifications will change over time as new versions of standards are published. An ordinal value with rank 5 today might become rank 6 tomorrow as a new value is inserted into a standard. A "hard coding" of a value's rank probably has to include the version of the ordinal value specification that the rank pertains to. The attractive thing about including order relations between adjacent ordinal values is that a ranking can be automatically calculated/updated from that.

cstoeckert commented 7 years ago

" How else to describe biopsies and removed organs?" Note that we use 'derives from' for specimen because that part of the original anatomical entity no longer exists as such. A lung specimen is not a type of lung but it does "begins to exist when the old entity ceases to exist" as a result of the specimen collection process, and it "inherits the significant portion of the matter of the old entity."

Public-Health-Bioinformatics commented 7 years ago

What is throwing me is that above "derives from" references John as a whole in its range, and since it is a relation "between two distinct material entities" in which the latter ceases to exist as a whole, not just in part, then John ceases to exist. Either "derives from" definition needs to be changed, or we need some other relation that doesn't make John cease.

cstoeckert commented 7 years ago

Good catch. That was an error in modeling on our part by leaving out an intermediate. The organ section derives from some material anatomical entity (part of colon) which was part of John. My inclination is not to change the derives from relation but either use it correctly or use specified output of a particular processes if we want to just talk about a piece of something (like extraction).

Public-Health-Bioinformatics commented 7 years ago

I'll followup with a separate side conversation with you about that "material anatomical entity" stuff as it pertains to GenEpiO modelling I'm doing right now. (I'm starting to envision the utility of enabling data structures to be represented in ontology - the what and where view - without necessarily referencing process entities at all; and instead have a separate "explanatory" why/how enhanced-view process ontology layer that rides on top of the data structure view. Presumably others have played with this distinction?)

GullyAPCBurns commented 7 years ago

This is very interesting. I think this is a common issue for people doing information integration work and I’ll ask some people here who work in that field what they would recommend doing. In the meantime, do you have a good source reference that describes each scale.

We should write up this work on Value Specifications for a research paper and this could serve as one of several examples.

Best

Gully

On Aug 21, 2017, at 11:32 AM, Damion Dooley notifications@github.com<mailto:notifications@github.com> wrote:

For the record here's the ordinal value mapping problem diagram (I'll note that the mapping between the two schemas is entirely my guesswork). As Gully mentioned, these could be treated as two separate ordinal scales with ranks separately assigned to each list's elements, and the mapping between them becomes a separate problem. [screen shot 2017-08-21 at 11 11 44 am]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_10779910_29532365-2Da0bbd944-2D8661-2D11e7-2D867c-2D7794cacea213.png&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=9GonQpooIxrF_mrLHrqLqtOsxVBjzLgywidZYe6GC8c&s=Q2J45mIvKgefoU8l4IyfyEwE-1eJKG73vXAo2vgjYt8&e= We'll just need to anticipate that a number of ordinal value specifications will change over time as new versions of standards are published. An ordinal value with rank 5 today might become rank 6 tomorrow as a new value is inserted into a standard. A "hard coding" of a value's rank probably has to include the version of the ordinal value specification that the rank pertains to. The attractive thing about including order relations between adjacent ordinal values is that a ranking can be automatically calculated/updated from that.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_obi-2Dontology_obi_issues_818-23issuecomment-2D323818405&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=9GonQpooIxrF_mrLHrqLqtOsxVBjzLgywidZYe6GC8c&s=q2o20qKSAnH7L40g7xswyQAQ-P8xJyiimk5VAST5oBw&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAjMghK8mS0b-5FoqhuTSsffeuWeZaV057ks5sac0jgaJpZM4M5HUB&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=9GonQpooIxrF_mrLHrqLqtOsxVBjzLgywidZYe6GC8c&s=xpt1uLq6gOazybKNWLFk6wJkbaAUODL2EAMesUdWHn4&e=.

Public-Health-Bioinformatics commented 7 years ago

It is definitely worthy of a paper when all the dust has settled! About "medical condition scale", our GenEpiO needed a subject or patient general health status, and I realized there can be quite a spectrum to that besides "dead, pathological, or alive" suggested by one standard. Source of these two scales is a Wikipedia writeup: https://en.wikipedia.org/wiki/Medical_state . I foresee an epidemic line list report that would derive trends from a categorical variable of this general nature from patient records or syndromic surveillance reports coming in from hospitals globally, rather than particular patient record diagnosis details.

GullyAPCBurns commented 7 years ago

I think it makes sense to incorporate existing coding schemes that are used as instruments in existing systems as much as possible.

Thanks for the feedback!

G

On Aug 22, 2017, at 11:05 AM, Damion Dooley notifications@github.com<mailto:notifications@github.com> wrote:

It is definitely worthy of a paper when all the dust has settled! About "medical condition scale", our GenEpiO needed a subject or patient general health status, and I realized there can be quite a spectrum to that besides "dead, pathological, or alive" suggested by one standard. Source of these two scales is a Wikipedia writeup: https://en.wikipedia.org/wiki/Medical_statehttps://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Medical-5Fstate&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=2OOdCnPe8b-B2bP4igZPNDelT58820RbBl31B5BotEU&s=MMyAnMza4Y0T7L2YSm_PBDKYB_7mBKmqcgipi-Tu7s0&e= . I foresee an epidemic line list report that would derive trends from a categorical variable of this general nature from patient records or syndromic surveillance reports coming in from hospitals globally, rather than particular patient record diagnosis details.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_obi-2Dontology_obi_issues_818-23issuecomment-2D324106370&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=2OOdCnPe8b-B2bP4igZPNDelT58820RbBl31B5BotEU&s=7bIqT7PRRBZjm8lWiU7jBcAmU4FUIXyNBD7TAXOYvm0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAjMgj1ejOfYMnHLox2R2Ot-5F9EJt8dmHks5saxhrgaJpZM4M5HUB&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=-nqJl0tDy3Y5cOEwpwfupA&m=2OOdCnPe8b-B2bP4igZPNDelT58820RbBl31B5BotEU&s=ISZQP-gHrnzsnX4JHhBYARi2aSGSyRw2SeJAbKZqIkw&e=.

turbomam commented 7 years ago

Posted https://github.com/turbomam/ontology_collaboration/blob/master/OBI/ordinal_proposal_20170922/ovs_merged.owl for discussion on Monday September 25th

turbomam commented 4 years ago

I would like to revisit creating an 'ordinal value specification" class