Open kbeckenrode opened 4 years ago
I would model this again after the GO cellular component ontology.
If modeled accordingly,
Systematically defining sets of microbes associated with specific body sites is something that we should discuss with Nicola and Curtis during my presentation on Oct 01. This is essential for defining an appropriate background for competitive gene set tests and you can thus assign this task to me. @lwaldron had suggested to obtain microbe sets for a handful of major body sites represented in cMD, and I think we could use that as a good first approximation.
Almost. For a term "skin" (or "skin microbiome") you would annotate: "Staphylococcus aureus has been detected with 16S metagenomic sequencing [= experimental evidence] in normal skin flora [citation]". If you wanted to model certain dysbiosis / disease states, then these would need to become individual ontology terms that you annotate specific microbes to. Eg a term "skin dysbiosis in cancer", with an annotation "Staphylococcus aureus has been shown [with evidence type XX] to be a part of dysbiosis in cancer patients [citation]". -- Note: microbial annotations to disease and dysbiosis are collected in BugSigDB. I don't know whether we would eventually like to organize them as an ontology, but we could if we want. To keep things simple in the beginning, however I think it makes sense to keep this separate, just as you keep the GO and GeneSigDB separate.
I'm thinking about how "normal skin microbiome" or "skin dysbiosis in cancer" will be annotated in the presence of a lot of low-quality evidence. We can utilize cMD, and maybe also QIITA for a lot more 16S data, but will have the usual problem of sensitivity/specificity tradeoff and false positives from individual studies. There's some decisions to be made about how to make those tradeoffs. I suppose that as long as redundant and sometimes conflicting annotations can be made, then these can be used as-is, aggregated, or ignored depending on the analysis they're being used for, and decisions on what evidence types to use can be modified as sensitivity analysis.
For determining the background for gut microbiome enrichment analysis, do you want to count a species that has been observed once in ten thousand gut specimens? If not, how to decide that threshold? Maybe it doesn't matter that much because the collector's curve levels off and provides a clear distinction between recurring gut microbiome and contamination or freak appearances, but I don't know.
There are many directions to go here. From a modeling standpoint you can:
Ontologically, you need to be clear about what you are modelling. Some examples of entities that you seem to want to model are:
Does this help or am I muddying the waters?
@wdduncan this certainly does help to clearly lay out the options.
From our discussions here and previously, I'm humbled to the complexity of the work, so @lwaldron and we discussed how it would be fruitful to building this ontology in versions. Even a simple ontology where we have bacteria associated with physiologies would be very useful. And then we can build upon these versions. We are thinking:
So, the goal in the very short term is to associate taxa with physiological attributes. And then build from there.
@wdduncan @lgeistlinger Do you have systematic strategies for building relationships with physiological terms together with ~1200 taxa? I have a data model with taxa and some physiological properties, so I think this could be useful. https://docs.google.com/spreadsheets/d/1Vp5uVi_WhX-f33sR-I7azWcSnC5bkka75SnRsMfZf3U/edit#gid=1045368624
I think the versioning makes good sense - already the first version is not a small task. Also I wouldn't let perfect being the enemy of good, the GO keeps improving since 20 years and is still far from perfect.
Nevertheless, regarding your question: what can we learn from the publications of the Gene Ontology consortium?
Deriving annotations from high-throughput data / a lot of low-quality evidence is not a new problem and strategies / solutions for deciding on eg thresholds must have been described in these papers. Those solutions are typically simpler than you might think, and often ad hoc, based on empirical standards (two-fold change, FDR < 0.1, ...).
When it comes to systematic strategies for annotation there are typically two types: text mining (manual and automated) and data mining (such as from existing 16S and WMS datasets). We are doing both already to some extent (Bergey's, cMD, ...), these are mostly case/resource-driven, and I am not aware of a general recipe for easily scaling this up besides going through it resource by resource.
This might help for importing annotations organized in a spreadsheet (csv, tsv, ...) to Protege (owl): https://protege.stanford.edu/conference/2009/slides/ImportingDataProtegeConference2009.pdf https://protegewiki.stanford.edu/wiki/Excel_Import
@kbeckenrode There are many ways to convert your spreadsheet to OWL. Here are two:
Are you wanting model:
Both have their advantages/disadvantages. In the information model (#1), you say that such-and-such article found some taxon to be in a particular anatomical site and the microbiome was in a state of dysbiosis.
In the assay model (#2), you can represent a particular sample as having been collected from a particular body site, along with the pertinent characteristics of the sample and site.
The taxon model (#3), is probably the easiest, but you may end up making strong statements that aren't always true of the taxon.
@wdduncan @lwaldron @lgeistlinger @cmungall
From our conversation on Thursday (thank you btw @wdduncan, it was really helpful), we discussed how to structure the relationship between the microbiome, taxa, body site, and health/dysbiosis. Here is a summary of the main discussion
Do we add subclasses of microbiomes based on body site and taxa are children of each microbiome body site ('E.coli' is contained in the 'skin microbiome' which is a part of 'human microbiome'). Or do we list all taxa and add a body site in the annotation properties as a comment. This option allows for more logical flexibility.
Even in healthy people, not all bacteria are present in everyone's microbiome. There needs to be flexibility in the logic when building the relationship between taxa and their body site location. This relationship is complicated even further by dysbiosis (an impaired microbiome. defined by an imbalance of abundance and composition of population). How do we build a relationships between bacteria associated with health and/or dysbiosis?
Thanks for your thoughts :)