Problem of Over-generalization

johnbeve commented 4 years ago

Werner has raised a concern worth reflection: '90% of the terms in the IDO: BacterialSpeciesX ontology, are identical to 90% of the terms in IDO: ‘BacterialSpeciesY’ with X replaced by Y.'

Barry is aware of the issue: 'This is indeed a puzzle that we have faced for some time. When we were building the lattice of staph. aureus IDOs it became clear that many ontologies will be needed just for staph. aureus, given that there are so many strains, and that cows and other non-humans can serve as hosts. '

I'm opening this up so we can explore the lattice proposal Barry, Alex, and Lindsay worked out as a partial response to this issue here: http://ontology.buffalo.edu/smith//articles/ICBO2012/IDO_Lattice.pdf

PhiBabs935 commented 4 years ago

The following is the seeds of our reply from the IDO paper (based on suggestions from Barry). If anyone watching has any thoughts on what else we could say, please let me know:

"There is a worry that our approach leads to an unneeded combinatorial explosion of ontologies. Indeed, in the era of personalized medicine, it may be that every single patient will have their own ontology. Certainly, there will be many more combinations of disease phenotypes that need to be dealt with, as time goes by. But that this is happening is not a matter of combinatorial explosion, but of growth dictated by the combinations which actually exist. When we were building the lattice of staph. aureus infectious disease ontologies [xx] it became clear that many ontologies will be needed just for staph. aureus, given that there are so many strains, and that cows and other non-humans can serve as hosts. The justification for this in the case of SARS-CoV-2 is that having an ontology for SARS-COV-2 will give us a baseline for dealing with SARS-COV-3 when it comes along."

I did add the following paragraph as one of the issues in the Limitations section of the IDO paper (this text incorporates a comment from Werner):

"As discussed in the foregoing, a notable feature of this approach is that it facilitates the inference of a lattice or network of infectious disease ontologies. While we do not believe that this leads to an unnecessary combinatorial explosion of ontologies, some tough questions do still remain. What is the principle for stopping somewhere? Each individual strain of virus or bacterium? Most antibacterial drugs work only against certain organisms, many organisms have resistance to certain bacterial agents. Where are these specific, yet overlapping, entities going to be represented? At the bacterial [GENUS?] level is too high, at the species level and below is too low. In that case we need something in-between. If we leave it at a level too low, there is no guarantee that distinct but similar entities that could be represented using the same template will actually be represented in a similar way. Addressing of these important issues we leave for future work."

johnbeve commented 4 years ago

I'm still not sure I really see the issue here. Let me say why as I offer a tentative response.

First, I think there's a natural stopping point for building ontologies: don't descend to tokens, stay at types. This helps with the personalized medicine example. Suppose we've an ontology for John. This ontology will presumably be composed of all and only terms from existing ontologies. They may be imported directly and used, they may be used to define new terms (which is just a way to abbreviate old terms), but I don't imagine introducing new primitives for John. The only reason I can see for thinking that'd be needed, would trade on conflating types and tokens. Since we're not, the result is a new ontology that's entirely built out of existing ontologies. Moreover, there's a clear overlap (lattices aside) between existing ontologies and the personalized John ontology, namely, the notions of definitional (and conservative) extension.

The virus strain example requires new terms, and there will likely be many needed. However, the only primitive terms I see needed for strains are terms for the specific strain. Everything else can be constructed based on existing ontologies and shown to be definitional and conservative extensions of those ontologies. Suppose we need a SARS-CoV-3-focused ontology. We have IDO, VIDO, CIDO, OBI, etc. We import and define what we can. We introduce the virus SARS-CoV-3 as a primitive subclass of coronavirus (strictly speaking, this would just be added to CIDO; ignore than for a moment). What we have then is an ontology largely composed of existing ontologies, with a proper part composed of a new primitive combined with existing terms. Those are the only new ontological entities. Now, we can ask whether it's justified to add them, and note the answer is obviously yes, since by assumption, we need to represent SARS-CoV-3 data.

So, if the worry is we'll have a bunch of new ontologies, that seems a feature rather than a bug, and any new stuff will likely be justified. I'm also inclined to say we should be as specific as researchers want, while not confusing types and tokens (and correcting their use if they do).

PhiBabs935 commented 4 years ago

Good stuff. Honestly, I put that stuff in the limitation section because I was trying to get the paper ready to go back to Werner for a second round of review (Barry has been pressuring me to) but I didn't really know what else to say myself. (Of course the need to revise the pathogenic disposition stuff that arose this week delayed the sending it off again anyway!!!)

I will think today about how to incorporate this, along with the points about from the limitations section, into the main text. This was one of the biggest issues that I believe Werner want us to address/defend against so I am glad we have more to say.

johnbeve commented 4 years ago

Me too, I think if he'd been right that'd be a huge problem!

On Thu, Jul 30, 2020 at 1:01 PM Shane Babcock notifications@github.com wrote:

Good stuff. Honestly, I put that stuff in the limitation section because I was trying to get the paper ready to go back to him for a second round (Barry has been pressuring me to) but I didn't really know what else to say myself. (Of course the need to revise the pathogenic disposition stuff that arose this week delayed the sending it off again anyway!!!)

I will think today about how to incorporate this, along with the points about from the limitations section, into the main text. This was one of the biggest issues that I believe Werner want us to address/defend against so I am glad we have more to say.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/infectious-disease-ontology-extensions/ido-core/issues/14#issuecomment-666567125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4ZM2Y7EHJHCFYI6ZBNVS3R6GYQXANCNFSM4OJTXD4Q .

PhiBabs935 commented 4 years ago

I know Barry suggested having an ontology for patients. What is the exact benefit of it? And how would it go? Does it involve just importing whatever terms are needed to represent the person's patient data, asserting OWL axioms connecting the individual patient to those terms (e.g. 'Bob participant in COVID-19 disease course', etc)?

PhiBabs935 commented 4 years ago

"Most antibacterial drugs work only against certain organisms, many organisms have resistance to certain bacterial agents. Where are these specific, yet overlapping, entities going to be represented?"

Now that I reread this, I am not sure I know what he is referring to as the specific yet overlapping entities. Is it about many organisms having resistance to the same antibacterial?

Either way, do you see what you wrote above as addressing this worry?

johnbeve commented 4 years ago

For patients, as I understand, Barry has in mind ontologies specific for individuals, since - despite our commonalities - each of us differs in various biological ways. I think it's easiest to see if you consider toy examples: most mammals have hair, lots have brown hair, fewer have specific shades; all have mass, fewer have specific mass X; and so on. When we get specific on determinable properties, we'll have unique ontologies reflecting particular individuals. Broadly speaking, these ontologies will overlap at the level of determinables, but not always at the level of determinates, and the latter is crucial for fine-grained patient diagnoses.

I don't think I understand Werner's comment about antibacterial drugs. If the concern is with the specificity of the ontologies, then I think the above response addresses it. If he's concerned (perhaps signaled by 'where...going to be represented') is whether there will be a specific ontology for these entities, then I'd say 'sure, if researchers want one.' The only way I can understand this as a worry is by assuming new ontologies built to cover such things will generate entirely disconnected ontologies covering the same domain as existing ontologies. But as I mentioned above, we can determine overlap in ontologies, and distinguish (justified) primitives from the rest, so I'd reject that underlying assumption.

PhiBabs935 commented 4 years ago

Cool. I will try to finish my work on the IDO paper before you get back to it later this week.

As I said, I began incorporating this stuff into the section on the IDO partitioning and lattices. But since this stuff is from your mind originally, can assign the task of incorporating this stuff to you?

infectious-disease-ontology-extensions / Fork-From-IDOCore

Problem of Over-generalization #14