What is the relationship between microbiome, taxa, body site and health?

kbeckenrode commented 4 years ago

@wdduncan @lwaldron @lgeistlinger @cmungall

From our conversation on Thursday (thank you btw @wdduncan, it was really helpful), we discussed how to structure the relationship between the microbiome, taxa, body site, and health/dysbiosis. Here is a summary of the main discussion

Do we add subclasses of microbiomes based on body site and taxa are children of each microbiome body site ('E.coli' is contained in the 'skin microbiome' which is a part of 'human microbiome'). Or do we list all taxa and add a body site in the annotation properties as a comment. This option allows for more logical flexibility.
Even in healthy people, not all bacteria are present in everyone's microbiome. There needs to be flexibility in the logic when building the relationship between taxa and their body site location. This relationship is complicated even further by dysbiosis (an impaired microbiome. defined by an imbalance of abundance and composition of population). How do we build a relationships between bacteria associated with health and/or dysbiosis?

Thanks for your thoughts :)

lgeistlinger commented 4 years ago

I would model this again after the GO cellular component ontology.

If modeled accordingly,

would be a clone of the UBER anatomy body-site ontology with microbes annotated to each body site of the ontology
Similar considerations apply to gene products which might or might not be present in cellular components depending on health status, and I think the GO solution would be again to reflect this via reference and evidence for each microbe being annotated to a body site. I think in general the GO cellular component annotations are based on observations in healthy individuals, but you could follow up on that whether there is more to that.

kbeckenrode commented 4 years ago

That is similar to what Bill had suggested. I agree that annotating body site with microbes would be a good way to model it. This seems like an impossible thing to do by hand. From the cMD_metaphlan microbe list @lgeistlinger provided for me, can we extract body site information attached to that?
So, reference and evidence denotes health status by saying "Staphylococcus aureus is part of normal skin flora (citation), and is a part of dysbiosis in cancer patients (citation)". ?

lgeistlinger commented 4 years ago

Systematically defining sets of microbes associated with specific body sites is something that we should discuss with Nicola and Curtis during my presentation on Oct 01. This is essential for defining an appropriate background for competitive gene set tests and you can thus assign this task to me. @lwaldron had suggested to obtain microbe sets for a handful of major body sites represented in cMD, and I think we could use that as a good first approximation.
Almost. For a term "skin" (or "skin microbiome") you would annotate: "Staphylococcus aureus has been detected with 16S metagenomic sequencing [= experimental evidence] in normal skin flora [citation]". If you wanted to model certain dysbiosis / disease states, then these would need to become individual ontology terms that you annotate specific microbes to. Eg a term "skin dysbiosis in cancer", with an annotation "Staphylococcus aureus has been shown [with evidence type XX] to be a part of dysbiosis in cancer patients [citation]". -- Note: microbial annotations to disease and dysbiosis are collected in BugSigDB. I don't know whether we would eventually like to organize them as an ontology, but we could if we want. To keep things simple in the beginning, however I think it makes sense to keep this separate, just as you keep the GO and GeneSigDB separate.

lwaldron commented 4 years ago

I'm thinking about how "normal skin microbiome" or "skin dysbiosis in cancer" will be annotated in the presence of a lot of low-quality evidence. We can utilize cMD, and maybe also QIITA for a lot more 16S data, but will have the usual problem of sensitivity/specificity tradeoff and false positives from individual studies. There's some decisions to be made about how to make those tradeoffs. I suppose that as long as redundant and sometimes conflicting annotations can be made, then these can be used as-is, aggregated, or ignored depending on the analysis they're being used for, and decisions on what evidence types to use can be modified as sensitivity analysis.

For determining the background for gut microbiome enrichment analysis, do you want to count a species that has been observed once in ten thousand gut specimens? If not, how to decide that threshold? Maybe it doesn't matter that much because the collector's curve levels off and provides a clear distinction between recurring gut microbiome and contamination or freak appearances, but I don't know.

wdduncan commented 4 years ago

There are many directions to go here. From a modeling standpoint you can:

Use custom annotations (i.e., annotation properties )to associate a microbe with an anatomical site and/or disease.
Use RDF reification to make meta-statements about microbes.
Use N-ary relations for this.

Ontologically, you need to be clear about what you are modelling. Some examples of entities that you seem to want to model are:

Disease: In OHMI, disease is a type of disposition. This is a somewhat idiosyncratic representation of disease, which I think we would normally classify as a process. If you meaning to represent the presentation of a disease, you may want to make the distinction between disease-as-disposition and disease-as-process (sometimes called a 'disease course').
Qualities, such as dysbioisis, that characterize the the microbiome. I think dysbioisis would a kind of PATO quality. But, I assume there are other qualities of interest.
Articles/information that present evidence for the presence/abundance/lack of some microbe being associated with a disease.
Lab tests and procedures that produce information about the microbiome. Here you'll have inputs and outputs.

Does this help or am I muddying the waters?

kbeckenrode commented 4 years ago

@wdduncan this certainly does help to clearly lay out the options.

From our discussions here and previously, I'm humbled to the complexity of the work, so @lwaldron and we discussed how it would be fruitful to building this ontology in versions. Even a simple ontology where we have bacteria associated with physiologies would be very useful. And then we can build upon these versions. We are thinking:

Version 1 Build relationships between physiologies and taxon
Version 2 Add sample body sites to taxon
Version 3 Add dysbiosis relationships

So, the goal in the very short term is to associate taxa with physiological attributes. And then build from there.

@wdduncan @lgeistlinger Do you have systematic strategies for building relationships with physiological terms together with ~1200 taxa? I have a data model with taxa and some physiological properties, so I think this could be useful. https://docs.google.com/spreadsheets/d/1Vp5uVi_WhX-f33sR-I7azWcSnC5bkka75SnRsMfZf3U/edit#gid=1045368624

lgeistlinger commented 4 years ago

I think the versioning makes good sense - already the first version is not a small task. Also I wouldn't let perfect being the enemy of good, the GO keeps improving since 20 years and is still far from perfect.

Nevertheless, regarding your question: what can we learn from the publications of the Gene Ontology consortium?

Deriving annotations from high-throughput data / a lot of low-quality evidence is not a new problem and strategies / solutions for deciding on eg thresholds must have been described in these papers. Those solutions are typically simpler than you might think, and often ad hoc, based on empirical standards (two-fold change, FDR < 0.1, ...).

When it comes to systematic strategies for annotation there are typically two types: text mining (manual and automated) and data mining (such as from existing 16S and WMS datasets). We are doing both already to some extent (Bergey's, cMD, ...), these are mostly case/resource-driven, and I am not aware of a general recipe for easily scaling this up besides going through it resource by resource.

lgeistlinger commented 4 years ago

This might help for importing annotations organized in a spreadsheet (csv, tsv, ...) to Protege (owl): https://protege.stanford.edu/conference/2009/slides/ImportingDataProtegeConference2009.pdf https://protegewiki.stanford.edu/wiki/Excel_Import

wdduncan commented 4 years ago

@kbeckenrode There are many ways to convert your spreadsheet to OWL. Here are two:

Are you wanting model:

Information (e.g., articles) that are about a taxon?
Results of lab assays?
The taxon themselves (e.g., add annotations to the taxon classes)

Both have their advantages/disadvantages. In the information model (#1), you say that such-and-such article found some taxon to be in a particular anatomical site and the microbiome was in a state of dysbiosis.

In the assay model (#2), you can represent a particular sample as having been collected from a particular body site, along with the pertinent characteristics of the sample and site.

The taxon model (#3), is probably the easiest, but you may end up making strong statements that aren't always true of the taxon.

waldronlab / MicrobiomeOntology

What is the relationship between microbiome, taxa, body site and health? #8