obi-ontology / obi

The Ontology for Biomedical Investigations
http://obi-ontology.org
Creative Commons Attribution 4.0 International
75 stars 25 forks source link

Value Specification as datatype #868

Closed Aqua1ung closed 8 months ago

Aqua1ung commented 6 years ago

I have been trying to understand the motivation behind the positing of the Value Specification (VS) class as it is currently defined in OBI. I think it is important to come clean about this before work on VS piles up to the effect that VS becomes “too big to fall”--which constitutes, in my opinion at least, a very dangerous attitude (namely refraining from radically restructuring and/or abandoning a modeling route just because too much work has been done on it by too many respectable people). The current definition of VS as recorded in the official OBI release does little to allow a clean demarcation of VS from some of the related ICE classes. Be that as it may, my main observation with respect to the way in which OBI attempts to define VS is the following:

Modeling value specifications as entities requires, at the very least, the capacity to model the set of real numbers (“the power of the continuum”) as entities.

This fact obviously constitutes a very powerful formal argument against treating VS as entities, hence it effectively kills the VS-as-entities route. Note that I have not made any mention of the ICE aspect of the matter: if anything, slamming the breaks on the VS ICE enthusiasm emerges as an added bonus of disposing of VS-as-entities.

Aside from the formal aspect of the issue, attempting to model the elements of the continuum as entities is a dead giveaway that poor modeling decisions have happened somewhere along the way, and that, quite likely, the modeling philosophy behind one’s modeling work needs serious reassessment (to put it mildly). In particular, one must have done something wrong if one is compelled to use VS as triple subjects. A healthy modeling endeavor should never lead one to attempt to model the continuum using the standard discrete tools of a modeling language like OWL. Standard OWL resources such as classes, properties, and individuals, have emphatically not been designed with this aim in mind. There are, indeed, tools in OWL that do allow one to represent infinite sets, though these tend to be more obscure, and, as such, less utilized even by experienced ontologists. These tools, however, do not represent infinite sets as regular classes of individuals.

It is, thus, my opinion that whoever introduced the VS class was actually looking to use it in a manner that is characteristic of Datatypes, though he/she was quite possibly unaware that OWL 2 allows users to define their custom datatypes.

In conclusion, I strongly recommend that VS be replaced with a datatype (be it pre-existing or custom-designed).

bpeters42 commented 6 years ago

Briefly: OBI started with modeling information as entities over 10 years ago, with explicit approval from Barry (which took a while). The original motivation was that we have to routinely deal with 'data items' that are generated as outputs of experiments. Originally Barry took your stance that we should only model truth, so we should not have to worry about information, and rather model 'what is real'. But the whole point of OBI is to model the reality about how investigations are performed. And one crucial element of investigations is that different experiments can generate conflicting data; that data is transformed (averaged, outliers removed), and that there is a step of going from instance level 'data items' to class level 'conclusion statements about reality'.

I will be the first to admit that OBI is a far way off doing these things perfectly, and there is a definitively a problem that what is in OBI proper has not been completely updated to reflect the overall goals that we have outlined.

I am hoping this is useful. Without wanting to stiffle discussion, I am worried about how much resources you and we are spending explaining something that in its current form is not documented to the degree that it is completely consistent. If you are frustrated by this response and by our unwillingness to reconsider modeling decisions (which I would very much understand), I would ask you to allow us time to clean up a consistent modeling before asking for your feedback again.

Thank you for your input,

Bjoern

We will not fundamentally question over 10 years of work. Especially as it se

On Fri, Oct 13, 2017 at 11:25 AM, Cristian Cocos notifications@github.com wrote:

I have been trying to understand the motivation behind the positing of the Value Specification (VS) class as it is currently defined in OBI. I think it is important to come clean about this before work on VS piles up to the effect that VS becomes “too big to fall”--which constitutes, in my opinion at least, a very dangerous attitude (namely refraining from radically restructuring and/or abandoning a modeling route just because too much work has been done on it by too many respectable people). The current definition of VS as recorded in the official OBI release does little to allow a clean demarcation of VS from some of the related ICE classes. Be that as it may, my main observation with respect to the way in which OBI attempts to define VS is the following:

Modeling value specifications as entities requires, at the very least, the capacity to model the set of real numbers (“the power of the continuum”) as entities.

This fact obviously constitutes a very powerful formal argument against treating VS as entities, hence it effectively kills the VS-as-entities route. Note that I have not made any mention of the ICE aspect of the matter: if anything, slamming the breaks on the VS ICE enthusiasm emerges as an added bonus of disposing of VS-as-entities.

Aside from the formal aspect of the issue, attempting to model the elements of the continuum as entities is a dead giveaway that poor modeling decisions have happened somewhere along the way, and that, quite likely, the modeling philosophy behind one’s modeling work needs serious reassessment (to put it mildly). In particular, one must have done something wrong if one is compelled to use VS as triple subjects. A healthy modeling endeavor should never lead one to attempt to model the continuum using the standard discrete tools of a modeling language like OWL. Standard OWL resources such as classes, properties, and individuals, have emphatically not been designed with this aim in mind. There are, indeed, tools in OWL that do allow one to represent infinite sets, though these tend to be more obscure, and, as such, less utilized even by experienced ontologists. These tools, however, do not represent infinite sets as regular classes of individuals.

It is, thus, my opinion that whoever introduced the VS class was actually looking to use it in a manner that is characteristic of Datatypes, though he/she was quite possibly unaware that OWL 2 allows users to define their custom datatypes.

In conclusion, I strongly recommend that VS be replaced with a datatype (be it pre-existing or custom-designed).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/obi-ontology/obi/issues/868, or mute the thread https://github.com/notifications/unsubscribe-auth/ANN9Ijab3Cuiqj4SxFzrzSj8JHa6AD8uks5sr6sWgaJpZM4P44Au .

-- Bjoern Peters Associate Professor La Jolla Institute for Allergy and Immunology 9420 Athena Circle La Jolla, CA 92037, USA Tel: 858/752-6914 Fax: 858/752-6987 http://www.liai.org/pages/faculty-peters

Aqua1ung commented 6 years ago

Hi Bjoern, see my reply inline please.

On 10/14/2017 11:48, bpeters42 wrote:

Briefly: OBI started with modeling information as entities over 10 years ago, with explicit approval from Barry (which took a while). The original motivation was that we have to routinely deal with 'data items' that are generated as outputs of experiments.

No argument there. If you read my postings carefully, you'll see that I am explicitly arguing that representing ICEs is unavoidable, especially in situations in which measurements can only be deemed as approximate. Many ICEs, however, are dispensable in favor of whatever they represent--basically in situations in which the representation is 100% accurate, with absolutely no chance of misrepresentation. My worry is that, caught in the fever of building the ICE scaffolding, OBI is in danger of losing sight of the fact that a great deal of ICEs are completely (and easily) dispensable in favor of their real counterparts. The basic situation that comes to mind is assays whose results have values in a discrete set, as opposed to assays with values in a set that has the power of the continuum (this distinction may or may not coincide with what you guys call "categorical" vs. "scalar"). The latter situation cannot avoid appeal to ICEs, while the former can afford to actually point directly to the real stuff, without the ICE middleman. (Yes, the former may have to use ICEs as well, in case the result of the measurement needs to have an ID (or whatever) that needs to be preserved in some official record etc. etc. etc.--again, i tried to explain these options in my previous postings.)

  • we are not planning to model real numbers as entities; we are using a relation 'has value' from instance to xsd:float or whatever the OWL formalism was to allow for numbers etc.

While you may not be planning to model real numbers as entities, you will effectively have to. Value Specifications (VS) as they are currently defined, and as have been presented to me by Chris among others (you yourself are taking this stance in the paragraph below), include potential results of measurement processes. As such, OBI will, in effect, have to allocate an IRI for every (VS "containing" a) real number! If 10.1g is a potential measurement result, OBI will have to have an IRI in stock for it. If 10.0002584g is a potential measurement result (and I do not see why not), OBI will have to give it an IRI. And so on and so forth. Every conceivable potential measurement result will have to have an IRI in OBI right off the bat! Not only that, but OBI will also have to have unique IRIs ready for the same numbers only on a length (and volume, and Amperes, and Volts etc.) scale, such as, say, millimeters (mm): 10.1mm, 10.0002584mm, 10.0002585mm etc. But wait, it gets better! What if someone wants to use meters instead of millimeters? Not only will you have to have a continuum-power infinity of IRIs to represent values in millimeters, but also another continuum-power infinity of IRIs to capture values in meters. And so on and so forth. Needless to say that not only is that impossible using the standard discrete tools of OWL, but this is not even how these standard resources (the OWL entities) have been designed to be used! Using OWL entities to "model" this constitutes patent misuse of OWL resources. But there is hope, so rejoice: OWL2 fortunately does include resources whose aim is precisely to capture the stuff that OBI has so far been trying to cram and shoehorn into the terribly inappropriate framework of OWL entities. Those resources are called datatypes. More precisely, custom, user-designed datatypes. Datatypes have been purposely designed to capture potential values of ... anything. Datatypes is the OWL equivalent of the world of potentialities. While OWL entities (have been designed to) represent the actualia, OWL datatypes (have been designed to) represent the potentia.

Now, I anticipate a response along the following lines: well, while it is true that the VS class encompasses all values of potential measurements, we will not have to represent all these values right at the outset, but instead we will add them as they "happen," or "as we need them." Here are two reasons why this would be wrong:

  1. The stronger reason: This defeats the purpose of having a class of potential measurement values. In OWL, if an entity is known to exist, it needs to be represented (today, tomorrow, next year etc.). On the other hand, all the "entities" that represent potential measurement values are known to "exist." How will you ever represent these "entities" knowing that it is logically impossible to represent them? Granted, I'd probably argue that, should OBI have no choice but to proceed as it has so far, I would not even bother raising this issue: If OBI can only be a hack job, then hack job it is! My point, however, is that it does not have to be a hack job! There is a perfectly reasonable, and elegant, and purposely designed solution to dealing with these "value specifications," only ontologists need to (a) be made aware that it exists, and (b) have the will (and openness) to embrace it.
  2. The weaker reason: It also looks to me that this is roughly what the Measurement Datum class was intended to capture--namely measurement results of actually performed assays--hence the class of Value Specifications emerges as a duplicate: you either have the class of Value Specifications fully populated with known potential measurement values (which is physically and logically impossible--see #1 above), or you have the VS class that achieves largely the same objectives as the Measurement Datum class. Either way, the VS class is not needed.
  • the point of 'value specification' is that we want to compare for example the value "10 g" when it is used in data items (such as the outputs from experiments e.g. "the mouse weighed 10 g" that have links to existing physical instances) to when it is used in experimental protocols (such as "Add 10 g of sugar to the solution"), or predictions ("after drug treatment, we predict that the mouse will weigh less than 10g")

See comments above. The only minor inconvenience that I see in using datatypes to represent VS is that datatypes cannot be used in subject place. That should easily be fixable by making sure that never happens. All the examples you just mentioned can easily be rephrased so as to avoid having "10g" in subject position. Problem solved.

I am hoping this is useful. Without wanting to stiffle discussion, I am worried about how much resources you and we are spending explaining something that in its current form is not documented to the degree that it is completely consistent.

I usually do my homework pretty thoroughly before attempting to change peoples' minds. Not only that, but I usually prefer to err on the side of letting sleeping dogs lie if I don't think I have a very acute issue to raise. In short, I don't usually speak without a damn' good reason :-) I know how much academics prize consensus, and I very much hate to be on the dissenting side. I am not relishing the posture of dissenting party.

If you are frustrated by this response and by our unwillingness to reconsider modeling decisions (which I would very much understand), I would ask you to allow us time to clean up a consistent modeling before asking for your feedback again.

Yes, I certainly understand the burden of an evolving model, though the trouble is that I actually have to work with actual concrete data sets that need to be captured in this mold. As it is right now, I am afraid that I will not be able to, and that may impact very concrete deadlines.

Thanks,

C

cstoeckert commented 6 years ago

Cristian, sorry but I am not persuaded by your arguments for doing away with all value specifications because of the points raised about numbers. I do agree that it needs more work, but I don't see data types working for categorical value specifications that I need for the tumor TNM classifications (and need to get in OBI now!). I also don't see these pointing directly to real stuff as TNM stages are defined (by the pathologists who use them) as combinations of T, N, and M values (see for example: https://staging.seer.cancer.gov/tnm/input/1.0/ovary/path_stage_group_direct/). The values for T, N, and M are conditional on different scenarios (see pT2 for example in https://staging.seer.cancer.gov/tnm/input/1.0/lung/path_t/). Each and every one of these can and should have an IRI. This is essentially what I am proposing in #856 Thanks Chris

Aqua1ung commented 6 years ago

Hi Chris, I can see that I have failed to make myself understood, and I can only blame myself for that. I will try to keep my arguments extremely brief, as I know you guys are awfully busy. Please read below inline.

Chris: I am not persuaded by your arguments for doing away with all value specifications

Christian: I have never proposed "doing away" with Value Specifications. All I am proposing is to represent them using the proper representation techniques. Entities have not been designed for what you are trying to use them for. Datatypes, on the other hand, have. That is precisely why datatypes have been added to OWL, so people do not have to add classes that are, in effect, duplicates of (or isomorphic to) standard mathematical objects.

Chris: I don't see data types working for categorical value specifications that I need for the tumor TNM classifications (and need to get in OBI now!).

Christian: TNM was one of the focal points of our work for IFOMIS (just ask Mathias). As such, I happen to possess some good insight into how TNM entities can be captured in very much Barry Smith-approved, ICE-free, datatype-free, real-Independent Continuant manner. (There was no ICE/IAO in those times.) Not only that, but this can be done pretty quickly--no longer, in fact, than it would take you to capture them as VS/ICE.

Chris: I also don't see these pointing directly to real stuff as TNM stages are defined (by the pathologists who use them) as combinations of T, N, and M values (see for example: https://staging.seer.cancer.gov/tnm/input/1.0/ovary/path_stage_group_direct/). The values for T, N, and M are conditional on different scenarios (see pT2 for example in https://staging.seer.cancer.gov/tnm/input/1.0/lung/path_t/). Each and every one of these can and should have an IRI. This is essentially what I am proposing in #856

Christian: Yes, TNM entities will have an IRI each, though they will not be value specifications, nor will they be datatypes either. (OWL did not allow custom user-designed datatypes at the time, nor did we feel that we needed them for TNM.) Also, as I mentioned in my previous post, datatypes are useful mostly for representing infinite sets. Feel free to ask me how to capture TNM entities as entities under the Independent Continuant umbrella.

Christian: This being said, I realize that pushing this angle can be counterproductive, hence this has been my last intervention on any matter pertaining to ICEs, Value Specifications, and Datatypes--barring, of course, explicit requests that I continue. I thank you and Bjoern for considering my proposals, and for replying to my posts.

C

Public-Health-Bioinformatics commented 6 years ago

As a relative newcomer to OBO/OBI I find these discussions interesting and am willing to learn the issues this way (though short on time too). However, is there background reference material (on OBI's side or in general philosophy) where OBO/OBI's position might be stated about VS and real numbers. If it doesn't exist, a summary position on the topic and decision would be good for all newcomers.

jamesaoverton commented 6 years ago

Yes we use RDF literals with the appropriate datatype to represent numbers, so for a scalar value specification X we could have a triple X ‘has specifified numeric value’ ”70.0”^^xsd:real and another triple for the units. We can create such value specifications as needed, identifying them with new IRIs or blank nodes as required. We can compare two scalar value specifications by their units and numerical values. All this is easy and common in OWL, RDF, or SPARQL, and has been sufficient for all my modelling needs since we developed the approach a few years ago.

I might want to write a lot of triples with value specification X as the subject. In particular, I foresee that we will want to add information about the precision of X, either as measured or as a required tolerance for a setting. I haven’t run across a case where I need a number to be a subject, but even then I don’t see the necessity of giving IRIs to numbers. Literal numbers are fine.

From an ontological perspective, BFO carefully avoids mention of abstracta such as numbers. Other upper ontologies do include abstracta, but they are difficult to handle, and I don’t expect BFO to incude them any time soon. Following BFO, in IAO and OBI we talk about concrete representations of numbers (in writing, in RAM) without talking about numbers in the abstract. Again, RDF literals suit this purpose well.

Public-Health-Bioinformatics commented 6 years ago

ok, thanks for explaining BFO/OBI & RDF literals.

Aqua1ung commented 6 years ago

I intend to do a little demonstration on Monday during my chairing of the OBO meeting, on how custom-designed datatypes work. The (very) short answer is: they work no differently than any of the built-in datatypes (xsd:string, xsd:float, etc.). Until Monday however, I will endeavor to answer some of the issues raised on the #879 thread.

[Digression] [Short version] Being compelled to use value specifications in subject position is very possibly the result of a questionable modeling decision that has boxed one into this corner.[/short version]

[Long version]I have to confess that I have not been able to figure out a way to represent ordered pairs--i.e. value specifications made up of two or more literals ((5, kg), (21, mg), (37, degree Celsius), etc.)--as datatypes. I toyed around with the idea of making a datatype out of lists (rdf:List) of literals, though it turns out that you cannot, at least not in the current OWL incarnation: the usual Kuratowski definition has thus not yet been assimilated into OWL. However, while the desideratum of having datatypes made out of ordered pairs (of literals) may be a legitimate concern, the puzzling issue remains the question "why would anyone need that?" Why would anyone need bi-dimensional datatypes anyway? I can, as a matter of fact, imagine situations where ordered pairs of literals might be required, though my impression is that, at least as far as Value Specifications are concerned, if you've boxed yourself into a corner where appeal to either entities or multi-dimensional datatypes appears to be the only way out, you must have done something wrong on the way there. There must have been some "less fortunate" modelling decision made somewhere in the past that has led to "having to" represent outcomes of measurement processes as entities or multi-dimensional datatypes. One such decision that comes to mind is the idea of capturing/modeling speech about units of measure in OBI, as opposed to handling that in the software, somewhere "outside." Nevertheless, should you guys be hell-bent on capturing speech about measurement units within OBI (which, again, I strongly advise against), one can think of different ways to handle measurement units, that do not require representing value specifications as entities (or, horribile dictu, ordered pairs of literals, or God knows what other funky contraption), such as attaching measurement units to the measurement process itself, etc. etc. As a physicist, this one seems to me pretty reasonable: once you've decided to carry out an experiment, surely you must've settled on a measurement unit to express your results as preparation for said experiment. I know I would. At the very least, your tools must have been calibrated in some unit or other. Again, I find it pretty natural to think of the measurement unit as a property of the experimental setup (and hence of the assay itself), and derivatively, of the output. In case one does not like the idea of speaking about assays as being characterized by a measurement unit, (and, again, I, for one, can't see why one would not), one should be free to move on to the next target, the measurement datum. Speak about the measurement datum as being characterized by a measurement unit. No need to push it further along, hence no need to turn value specifications into entities. Let value specifications be strings, numbers, or whatever other literals there may be.[/long version] [/digression]

SELECT DISTINCT ?msa ?vs
WHERE {?msa a :MassSpectrometryAssay ; :has_specified_output ?md . ?md :has_value_specification ?vs}

You'll get a table with mass spectrometry assay IRIs in one column, and numbers (or strings) in the other. The one rule of thumb is, as long as you don't require value specifications in subject position (and why would anyone want that?), you should be safe. If, on the other hand, one feels compelled to use value specifications in subject position, this, in my experience, is the likely result of a questionable modeling decision made somewhere else in the model, decision that that has boxed you into this corner. (About that, see more in the "digression" above.)

GullyAPCBurns commented 6 years ago

Christian,

After looking into this for myself, I think we can find a compromise. I agree that we probably don't want to overload data too much with this sort of representation. However, in terms of describing the classes of data that are likely to be generated by experiments, Value Specifications are likely to be useful.

G