Taxon pages (was: Microbial annotations)

lwaldron commented 4 years ago

How do we begin annotating the microbial taxonomy with morphological and physiological properties? For example:

We would want to be able annotate individual taxa or hierarchical clades of the taxonomy. This is the high-priority item

Eventually we would want to be able to export and analyze these like signatures, at any level of the taxonomy. This is lower-priority.

tosfos commented 4 years ago

There's some tension here. On the one hand, we want to tie our data to NCBI. On the other hand, we want to be flexible enough to be able to add info for taxa that they don't know about (I think, right). We'll have to find a delicate middle, so this has to be done carefully.

Is any of this information already store somewhere like NCBI?

I'm confused regarding annotating "hierarchical clades". Isn't it only individual microbes that have these properties? Are we also tracking properties that are presented by a group of microbes together? And if a group has a certain property, does that mean that they only present that property when they are combined or that the group has the property but we haven't determined which particular organism is causing it?

How does this data relate to a Signature?

My assumption right now is that we would tag an individual taxon with a certain property. If we tag a Genus, that would mean that to the best of our knowledge everything belonging to that Genus also has that attribute. If we would end up discovering that a certain Species in that Genus does not have it, we would remove the tag from the Genus and apply it to only the Species that do have it. Please correct me if I'm missing something.

lwaldron commented 4 years ago

There's some tension here.

Right. There's also the possibility of additions and corrections to the taxonomy in future versions of NCBI.

In the current "signatures", when a genus (or higher-order level of the taxonomy) is there, it means that genus was observed as a whole to be differentially abundant (ie the sum of all its members), but not necessarily all of the species individually. In fact, it could be just one species of the genus driving the differential abundance, but we don't know that.

annotating "hierarchical clades"

You're right that properties like individual species have the properties, but sometimes these properties are shared by all members even of a phylum (e.g. the Synergistetes phylum are all obligate anaerobes, Gram-negative staining, and have rod/vibrioid cell shape. It would be useful to annotate this phylum, as well as all taxonomic levels below it, just by annotating the phylum.

How does this data relate to a Signature?

As an example use case, we found that Tobacco exposure is associated with oral microbiota oxygen utilization from a dataset that provided only genus-level resolution, by creating signatures of aerobic and anaerobic genera, and finding that the aerobic genera were less abundant in the oral microbiomes of smokers and aerobic genera were more abundant. Given a higher-resolution dataset, we could repeat this analysis at the species level. Creating "signatures" at different levels of the taxonomy based on these annotations is lower priority and could be a special data export procedure.

My assumption right now is that we would tag an individual taxon with a certain property. If we tag a Genus, that would mean that to the best of our knowledge everything belonging to that Genus also has that attribute. If we would end up discovering that a certain Species in that Genus does not have it, we would remove the tag from the Genus and apply it to only the Species that do have it. Please correct me if I'm missing something.

That makes sense.

Is any of this information already store somewhere like NCBI?

I need to research this. NCBI may have some, but for sure this is going to be a big, ongoing curation job involving pulling different pieces of information from multiple sources.

tosfos commented 4 years ago

In the current "signatures", when a genus (or higher-order level of the taxonomy) is there, it means that genus was observed as a whole to be differentially abundant (ie the sum of all its members), but not necessarily all of the species individually.

Would it make more sense to store a score indicating a level of differential abundance instead of just a simple yes/no?

lwaldron commented 4 years ago

Would it make more sense to store a score indicating a level of differential abundance instead of just a simple yes/no?

The kind and reporting of such scores is so inconsistent that we just record yes/no. This has precedence in mSigDB and GeneSigDB, and the statistical methods for gene set enrichment analysis analysis only employ presence/absence in the signatures, so I thought it would be too much additional curation work without clear benefit to record actual scores like p-values, fold-change, log fold-change, or LDA scores. As a compromise, we instead record the thresholds used (p-value/q-value, LDA score, in the "Experiment").

However, if the option were there in the signatures data structure, such information could be easy to enter as part of a bulk signature entry for new experiments in the future (as opposed to those extracted from published literature), and maybe then we or someone would find a use for them.

tosfos commented 4 years ago

It would be practical to annotate this phylum, as well as all taxonomic levels below it, with these properties.

Did you mean to say "impractical" here?

tosfos commented 4 years ago

As an example use case, we found that Tobacco exposure is associated with oral microbiota oxygen utilization from a dataset that provided only genus-level resolution, by creating signatures of aerobic and anaerobic genera, and finding that the aerobic genera were less abundant in the oral microbiomes of smokers and aerobic genera were more abundant. Given a higher-resolution dataset, we could repeat this analysis at the species level. Creating "signatures" at different levels of the taxonomy based on these annotations is lower priority and could be a special data export procedure.

I'm assuming that what we'll need to do is create a completely separate Taxon (is that the best title?) data structure for each taxon (not every taxon in existence, just the ones we have data on). That will:

Store and display properties about that taxon which are differentially abundant
Store and display the source data for how we know that information.
Query for (display) its taxonomical hierarchy based on the NCBI data we already have.
(I assume) Query for and display all the Studies that have this taxon in one if its Signatures.

Please let me know what you think about this. Also if there is a set list of properties that will be allowed or suggested, please send it. Also, this could get a bit complex. I'm assuming every microbe is either aerobic or anaerobic, so maybe this should be more than just a free-form list of properties.

What I'm wondering more is how we are going to show this data on a Study/Experiment/Signature page, if at all. I'm assuming that each NCBI item in each Signature should link to a Taxon page with information about that taxon, and maybe we should even remove the tooltip if we're doing that. Please advise.

Also, do we want the Signature page to actually do data analysis automatically? Like if, say, 75% of its taxons are aerobic, should it say that somewhere?

tosfos commented 4 years ago

As a compromise, we instead record the thresholds used (p-value/q-value, LDA score, in the "Experiment").

However, if the option were there in the signatures data structure, such information could be easy to enter as part of a bulk signature entry for new experiments in the future (as opposed to those extracted from published literature), and maybe then we or someone would find a use for them.

I'm confused about this. The morphological and physiological properties are not related to a Signature but rather an individual Taxon. Are you suggesting that we'll store this information about a Signature - as in "this Signature is differentially abundant regarding X"? Would it make more sense to store that information on the individual taxons' pages and then automatically do a data analysis as I mentioned in the previous comment?

lwaldron commented 4 years ago

It would be practical to annotate this phylum, as well as all taxonomic levels below it, with these properties.

Did you mean to say "impractical" here?

It was a vague sentence, so I re-worded it to "It would be useful to annotate this phylum, as well as all taxonomic levels below it, just by annotating the phylum." Yes, it would be impractical to manually annotate every individual species for a property that is shared by an entire phylum (or even by most of the phylum, as an example you gave where an exceptional species needs to be corrected)

I'm assuming that what we'll need to do is create a completely separate Taxon (is that the best title?) data structure for each taxon (not every taxon in existence, just the ones we have data on). That will:

Store and display properties about that taxon which are differentially abundant

Store and display the source data for how we know that information.

Query for (display) its taxonomical hierarchy based on the NCBI data we already have.

(I assume) Query for and display all the Studies that have this taxon in one if its Signatures.

This makes sense, and yes, taxon is probably the right title. Additionally, it would 5. Store and display physiological and morphological properties of that taxon, along with 7. how we know that information. 7. would not necessarily be from studies/experiments, unless we repurposed how those are used, because information like oxygen utilization will be compiled manually from sources like Bergey's Manual and in bulk from the Microbe Directory (paper, web site, and example of a species Enterococcus pallens) and the Microbial Life Database.

Please let me know what you think about this. Also if there is a set list of properties that will be allowed or suggested, please send it. Also, this could get a bit complex. I'm assuming every microbe is either aerobic or anaerobic, so maybe this should be more than just a free-form list of properties.

You're right, there are a limited number of possible options here. I had imagined a curator setting down to identify all aerobic taxa, then annotating all anaerobic taxa, then annotating all facultative taxa, etc. These physiological and morphological annotations would be selected from an existing ontology.

What I'm wondering more is how we are going to show this data on a Study/Experiment/Signature page, if at all. I'm assuming that each NCBI item in each Signature should link to a Taxon page with information about that taxon, and maybe we should even remove the tooltip if we're doing that. Please advise.

Just linking to a Taxon page seems like the right thing to me, maybe even instead of linking to NCBI (with NCBI link on the Taxon page instead). I have a hard time imagining filling up the Study/Experiment/Signature with information about individual taxa, other than possibly taxonomic level.

Also, do we want the Signature page to actually do data analysis automatically? Like if, say, 75% of its taxons are aerobic, should it say that somewhere?

Down the road, this would be great - an automatic identification of other signatures with high overlap, and of enrichment of aerobic taxa for example. These would probably involve setting a threshold of enrichment and displaying taxonomic properties and signatures that meet that threshold (for example, this signature is enriched for aerobic taxa, and has high similarity to these other signatures).

However, if the option were there in the signatures data structure, such information could be easy to enter as part of a bulk signature entry for new experiments in the future (as opposed to those extracted from published literature), and maybe then we or someone would find a use for them.

I'm confused about this. The morphological and physiological properties are not related to a Signature but rather an individual Taxon.

I'm a little confused about what you have in mind too. For the moment let's drop the idea of storing differential abundance scores, since we don't have them in our current spreadsheet and aren't sure how to use them anyways. Then my above comment is moot. But there may be additional confusion, because while you are correct that "morphological and physiological properties are not related to a Signature but rather an individual Taxon", and it makes sense to represent them as such in the data structure (and not as signatures), I do envision assembling groups of taxa that share a particular morphological or physiological property, in effect creating a "signature" of like taxa for the purpose of data analysis.

Are you suggesting that we'll store this information about a Signature - as in "this Signature is differentially abundant regarding X"?

"Differentially abundant" isn't the right terminology here, it would be "enriched", ie this group of taxa are enriched for aerobic genera, or this group of taxa are enriched for taxa from another signature. Meaning (for example) the fraction of aerobic genera is significantly greater than the fraction of aerobic genera overall, or the fraction of taxa in common with Signature 23 is greater than we would expect from a random sample of taxa. This isn't a top priority, but providing some automatic analyses like this will provide a reward to other researchers to enter their data into the wiki.

Would it make more sense to store that information on the individual taxons' pages and then automatically do a data analysis as I mentioned in the previous comment?

You mean the information "X", like whether it is aerobic? Yes, I think it makes sense to store that on the individual taxons' pages and then automatically do a data analysis. These will be lightweight analyses, and any meaningful amount of time they take would just be for any data aggregation that might need to take place (for example, whether a signature is statistically significantly enriched for aerobic taxa requires knowing which of those taxa are aerobic, and a list of all taxa known to be found in the same body site. A compromise could be made by instead only showing fractions of aerobic (etc) taxa and highly-overlapping signatures, without a use of any statistical test that requires accounting for the taxa not present.

But to just directly answer your question, an automatic data analysis sounds more desirable than storing these results as information about a signature. Especially since the analysis should reflect the up-to-date data in the wiki.

tosfos commented 4 years ago

Additionally, it would 5. Store and display physiological and morphological properties of that taxon, along with 7. how we know that information.

This is exactly what I meant with my items 1 & 2. Please explain.

tosfos commented 4 years ago

I had imagined a curator setting down to identify all aerobic taxa, then annotating all anaerobic taxa, then annotating all facultative taxa, etc. These physiological and morphological annotations would be selected from an existing ontology.

So then it might make sense to have individual fields for some of these properties. Like instead of allowing tagging a taxon as "aerobic", etc, would it be better to separate these into fields? Like have a dropdown for selecting aerobic or anaerobic, and maybe some other fields that are boolean/binary choices. It wouldn't make so much sense to allow, for example, tagging a taxon as both aerobic and anaerobic.

tosfos commented 4 years ago

I do envision assembling groups of taxa that share a particular morphological or physiological property, in effect creating a "signature" of like taxa for the purpose of data analysis.

I don't think this affects the Taxon data structure. If this assemble is done as a full-fledged Study then it will just use that data model. If not, we'll probably need to create a separate data model for this. In truth, it seems like this is not something that would be stored as data at all, but rather queried based on the Taxon structure we're creating.

tosfos commented 4 years ago

I think we're mostly on the same page here. The main question will be budgeting to fill out all these features, but we can at least begin by creating the Taxon data structure, template and form. And then we will sort-of link the current NCBI field to this new structure.

If you have a spreadsheet of the data that will be stored in the new data structure, please send it or create it. Also, please send any ontologies we'll be following.

lwaldron commented 4 years ago

Additionally, it would 5. Store and display physiological and morphological properties of that taxon, along with 6. how we know that information.

This is exactly what I meant with my items 1 & 2. Please explain.

I was distinguishing between differential abundance in 1 & 2 (something we know because the Taxon appears in a Signature) and physiological or morphological properties in 5 & 6 (something that the Taxon has been annotated with individually or by inheritance).

Will write more on your other questions later today. We don't have much Taxon information right now, but we can put together a spreadsheet of at least some exemplary data to work with.

tosfos commented 4 years ago

Got it. So it seems like both the differential abundance and the physiological or morphological properties will be applied to a certain Taxon and it will be assumed that all members of that Taxon inherit that property. Is that correct?

lwaldron commented 4 years ago

Got it. So it seems like both the differential abundance and the physiological or morphological properties will be applied to a certain Taxon and it will be assumed that all members of that Taxon inherit that property. Is that correct?

For physiological and morphological properties that is the assumption, but for differential abundance signatures it is not. Experimental differential abundance of a genus does not imply that all species of that genus are also differentially abundant under the given group 0 - group 1 contrast.

lwaldron commented 4 years ago

Here is a demo section of the Bergey's reference manual on the Firmicutes phylum, to provide some more detailed background on the kind of information we're talking about. The grant proposes to have a PhD student in Curtis Huttenhower's group at Harvard working to curate/systematize this information for 4 years starting in Y2.

I've also been organizing other potential sources of lower-hanging fruit on this site's wiki. Here's a snapshot of what's currently there. I think we can use the Microbe Directory or Microbial fatty acid compositions for initial data to work with. I think the only things we need to do is make a spreadsheet that 1) uses NCBI taxID, and 2) maps properties to an appropriate ontology.

Microbe Directory web site, data source on GitHub (has sql, csv, and json), and publication. Uses MetaPhlan2 names, need to map to NCBI taxID.
Microbial fatty acid compositions, provides two csv files using NCBI taxID.
Microbial Life Database - site is mostly broken, sent request for annotation data (Levi, March 26)
Microbe Wiki (MediaWiki site). See its taxonomy index - does not seem to be semantic, so I'm not sure there's anything we can use in bulk download.
PATRIC has a huge amount of information derived from genome analysis, such as metabolic products, metabolic pathways, and "specialty genes". For example, pathways associated with Fusobacterium nucleatum. Something to think about the feasibility of importing down the road. Organized by NCBI taxID.
Ludwig's curation efforts and README, some of which are listed above.

tosfos commented 4 years ago

Hmmmm. I though I had commented here but I don't see it.

I think we should start with this: https://github.com/microbe-directory/microbe-directory/blob/master/data/microbe-directory.csv

We'd just use the Species column and remove Genus..Kingdom. We don't need the NCBI ID to start, though it would be nice. We could always add it later. Please check the columns in that CSV. Are these sufficient? Or are there additional properties that you would like to store too?

I think it is worth dropping differential abundance for now.

tosfos commented 4 years ago

Actually I was able to match most of these to NCBI IDs

tosfos commented 4 years ago

The Microbe directory has 7678 Species. Do we need a subset only? Only Bacteria?

lwaldron commented 4 years ago

Hmmmm. I though I had commented here but I don't see it.

I think we should start with this: https://github.com/microbe-directory/microbe-directory/blob/master/data/microbe-directory.csv

We'd just use the Species column and remove Genus..Kingdom. We don't need the NCBI ID to start, though it would be nice. We could always add it later. Please check the columns in that CSV. Are these sufficient? Or are there additional properties that you would like to store too?

I think it is worth dropping differential abundance for now.

I agree the Microbe Directory seems like a good starting point, with a good variety of data. I think we'll gain stability and up-front correctness in the long run from using an ontology for our terminology - prokaryotic quality from the Ontology of Prokaryotic Phenotypic and Metabolic Characters looks like a good choice. We could provide you with a flat list of terms, and translate the Microbe Directory columns to use this ontology?

The Microbe directory has 7678 Species. Do we need a subset only? Only Bacteria?

Our current work is focused on Bacteria and that's where the best microbiome measurement technology is currently, but in the future I'll be inclined to keep all kingdoms because there is microbiome research being done focusing on viruses and fungi (eukaryotes) too.

tosfos commented 4 years ago

microbe-directory.minimal.ncbi.zip I attached the work we performed so far on the Microbe Directory CSV. Column C is for the Page title, which will be important. It should come from either Column D (what NCBI calls this species) or Column E (what the Microbe Directory calls this species.) Also you may want to fill in any missing NCBI IDs that we were not able to automatically look up.

lwaldron commented 4 years ago

@kbeckenrode have you looked at this yet? I'd like to discuss before we do a full switch-over to the wiki.

kbeckenrode commented 4 years ago

@lwaldron yes, and you read my mind because I'm ready to discuss how to add attributes to wiki.

kbeckenrode commented 4 years ago

Hi @tosfos @lwaldron

This is a toy model spreadsheet that I've made using the Microbe Directory V2. I described one physiology: human associated or not. There are 5 columns:

taxon (all defined to species level, but this will not be the case moving forward. Formatted using the metaphlan format)
attribute (physiology being described)
value (TRUE/FALSE)
context (Empty for now, but will ultimately allow for relevant context-dependent biologically information)
source (citation)

There are 2,743 species added in the sheet. 770 are human associated and the other 1,973 are not. Tried to keep this simple to start. I want to add physiologies that have varied taxon inheritance, but that will come a bit later.

microbe_phys_DB.xlsx

lwaldron commented 4 years ago

Very cool @kbeckenrode ! @tosfos - @kbeckenrode @lgeistlinger and I met and agreed this data model and import format should be adequately flexible for any taxonomic characteristic we'll want to curate.

kbeckenrode commented 4 years ago

@tosfos @lwaldron @lgeistlinger

You can go ahead and disregard the previous spreadsheet I shared. I made some big adjustments that will be more representative of the microbe physiology database going forward.

Here we have three physiologies: human associated, Gram stain and oxygen utilization. Most importantly, I added physiologies at different taxon levels (from phylum to species). Taxon described at the phylum level can have all taxon below inherit that physiology.

taxon (Formatted using the metaphlan format)
attribute (physiology being described)
value (there are a variety of types of values, like TRUE/FALSE, and descriptive text)
context (There are a few examples of context-dependent information and exceptions)
source (citation)

MBphys_20200616.xlsx

tosfos commented 4 years ago

If a Taxon has an attribute we'll assume that all its descendants have the same attribute unless indicated otherwise by the child Taxon.

tosfos commented 4 years ago

@kbeckenrode Should we support multiple context notes, each once with a different source?

tosfos commented 4 years ago

For row 66:

_k_Bacteria_p_Gemmatimonadetes | Oxygen utilization | Both

What does "both" mean? Is it "Facultative anaerobic"? Or does it mean that its descendants can be one or the other?

kbeckenrode commented 4 years ago

@tosfos

I did a little more digging on _#66 k_Bacteria_p_Gemmatimonadete, and let's call it facultative anaerobic. I was parsing through some confusing text.

I think we should only need to support one source per context.

Thanks!

lwaldron commented 4 years ago

@keckenrode, can you add a column of NCBI IDs? You could ask for help from Jonathan...

kbeckenrode commented 4 years ago

@tosfos @lwaldron

Here is an updated sheet with the NCBI number column. There are three rows without an ID. Thanks to Jonathan for the good work!

MBphys_NCBI_ID .xlsx

tosfos commented 4 years ago

Please review the attached spreadsheet. Do the column headers and order make sense scientifically? Also, please see how we combined taxa that had multiple rows. Would it be easy for you to fill in the attached spreadsheet using this format? Taxon.xlsx

Regarding the "Context" field, is it serving 2 different purposes? Sometimes it gives a location like "body site" and sometimes it mentions "Except..." Should it be renamed "Notes"?

kbeckenrode commented 4 years ago

The column heading and order make sense.

@lwaldron, @lgeistlinger and I had discussed the pro's and con's on combining attributes in the same taxon row. The main problem is the inheritance may not always be true for all the taxa. For example, an entire phylum may be anaerobic, but not all species are the same cell shape. This is why we thought splitting each property into separate rows would help avoid this issue.

lwaldron commented 4 years ago

I think @tosfos’s “wide format” table contains exactly the same information as what you had, just with data for one taxid collected onto a single row. I don’t see any limitations in this format for annotating exceptions to inherited properties.

A couple questions/comments:

it looks like there is no requirement that “Taxon[Attribute name]” remains the same down the column? Ie, cell C2 is “Gram Stain” in the first row for taxid 1090, but then “Gram Stain” could be recorded in column H3 for the next taxid 1117.
we should be able to repeat an attribute with different values and different contexts for the same Taxon. For example, a taxon that is aerobic in one context and anaerobic in another.

I think the “Context” field should remain context (analogous to the “condition” column in BugSigDB) and not used as a free-form notes entry. We should use an ontology (like EFO) for this.

seandavi commented 4 years ago

I think we may be trying to do too much in the BugSigDB database as currently implemented. I'd separate concerns a bit (which I think @tosfos has hinted at) to isolate the creation of signatures, which mainly rely on getting taxa identifiers, and taxon annotation tasks. The taxon annotation tasks will benefit from a richer data ecosystem that includes multiple data sources (each of which may have their own data modeling concept) and require a translation process that includes logic more complex than what can be included in the columns of a spreadsheet.

In terms of BugSigDB, I suspect that there will need to be a many-to-many relationship between taxa and characteristics. The data underlying that many-to-many relationship can be derived from a richer data resource (a separate working database) that focuses on integrating interrelated microbial characteristics to appropriate taxa.

ERD

This is just a suggestion and doesn't have to be immediately implemented, but the information content of many-to-many and "context" seems important. That context may even need to be study-specific, so adding a study id to the bridging table may be necessary.

kbeckenrode commented 4 years ago

@seandavi wow thanks for this explanation. what would be an example of a characteristic id? Is this an arbitrary number ID we assign?

seandavi commented 4 years ago

@seandavi wow thanks for this explanation. what would be an example of a characteristic id? Is this an arbitrary number ID we assign?

The "id" columns are "primary keys" in the database; they must be unique. There are roughly two types of primary keys:

Natural keys
Surrogate keys

Natural keys are things that occur naturally in the record that are unique; these could be external IDs or a name that is always unique. A surrogate key is one that is arbitrarily assigned to the record to be unique and allow the record to be identified uniquely across the database system.

A couple of general guidelines that I often try to follow (but these are not hard-and-fast rules) when designing a database:

Database primary keys should generally be the surrogate variety since we cannot control uniqueness of other things (eg., the US Social Security Number is NOT unique!!!)
Database primary keys should generally not carry "information" such as names/locations, etc., since these can also change over time, etc.

Note that what I am describing above is the "physical" model on the database system and not the "conceptual" model. The conceptual model and the physical model are not always identical. For example, the conceptual model does not necessarily need a surrogate primary key, but the physical model does. That said, the conceptual model does need a primary key in the sense that one needs to be able to uniquely identify members of each conceptual entity (like "sample" or "taxon" or "characteristic").

The code for producing the plot above is written in "graphviz dot" language. Save the following as bugsigdb.dot, install graphviz, and then run the command line below.

digraph structs {
    node [shape=record];
    struct1 [label="TAXON | <pk> taxon id | taxon name"];
    struct2 [label="CHARACTERISTICS | <pk> characteristic id | characteristic | ontology id"];
    struct3 [label="TAXON_TO_CHARACTERISTICS | <fk2> characteristic id | <fk1> taxon_id | context"];
    struct1:pk -> struct3:fk1;
    struct2:pk -> struct3:fk2;
    rankdir = LR;
}

dot -o bugsigdb.png -Tpng bugsigdb.dot

kbeckenrode commented 4 years ago

Hi @tosfos, from a conversation with @lwaldron and @lgeistlinger about the context column, we think it's best to proceed by adding a write-in option with automated fill-in's. I can provide you with a list of terms for the option, if that is helpful.

kbeckenrode commented 3 years ago

@tosfos @lwaldron

Here is a revised and expanded data model. This new model is based on your suggestions and some other feedback. We have 1,616 microbes at the genus level annotated for Gram stain, respiration, size and cell shape.

I added two new column types: evidence (yes means lab based evidence, no means no lab based evidence. This will be more important later when we start adding more predicted annotations and for unknown species) and inheritance (yes means annotation inherits downward to the species, no means species do not have the trait). Evidence and inheritance are columns that are associated with each attribute.

Take a look and let me know if you have any questions. Taxon_annotations_20200820.xlsx

kbeckenrode commented 3 years ago

Thanks to @lgeistlinger, we are going to incorporate evidence codes in our evidence column. This means the evidence column will have a drop down menu of these acronyms: I'll send the updated file today

kbeckenrode commented 3 years ago

@tosfos @lwaldron Updated columns with data validation drop downs

Taxon_annotations_datavalidation.xlsx

lwaldron commented 3 years ago

Excellent. A few notes / questions for @kbeckenrode and @tosfos :

Almost all of these include NCBI ID, so I think the "genus" column can be ignored and any rows that don't have an NCBI ID.
I'm unconvinced about having columns like "Taxon[Attribute name]" where the name of the attribute is specified, followed the column "gram_type" which provides the attribute values. I would have called the second column "Taxon[Attribute value]", keeping with generic column names and informative row entries.
It would be advantageous for us to be able to define allowed vocabulary for name-value combinations (for example, Oxygen Utilization can be facultatively anaerobic, obligately aerobic, or aerobic) through an administrative page. I'm not sure what to make of an entry like on row 40, "Organisms are aerobic, microaerobic, facultatively anaerobic, or chemoorganotrophic, having both respiratory and fermentative types of metabolism"
@kbeckenrode it would be nice to have at least one example (even a made-up, incorrect one to be fixed later) with an evidence type other than EXP, and inheritance other than Yes, in order to discuss and test how these are handled in the wiki.
@tosfos since we have a study data model, should we somehow use that for the Taxon[Attribute source] values? However, these may not have an associated PMID.

kbeckenrode commented 3 years ago

@lwaldron @tosfos

For manual curation, I think having the taxon ID written out could be helpful. But, maybe we re-phrase that column to [Name] instead.
The only time I can see the "Taxon[Attribute name]" being helpful is when there are many attributes to choose from, like, shape or even oxygen utilization. But, you're right, the two columns can probably be merged.
I'll have to look at row 40 more closely, but my guess is a lot of variability in the species. But, will we be able to choose more than one value in some columns? Or do we have to provide every combination (yikes!).
Attached to this comment is (real) examples of different evidence type (secretion system prediction in rhizobium bacteria) and an attribute that does not inherit downwards (human-associated!). Taxon_annotations_20200828.xlsx

tosfos commented 3 years ago

@lgeistlinger

1. Almost all of these include NCBI ID, so I think the "genus" column can be ignored
Makes sense

and any rows that don't have an NCBI ID.

So just not import those rows? Won't we be missing that data then?

2. I'm unconvinced about having columns like "Taxon[Attribute name]" where the name of the attribute is specified, followed the column "gram_type" which provides the attribute values. I would have called the second column "Taxon[Attribute value]", keeping with generic column names and informative row entries.

Definitely.

3. It would be advantageous for us to be able to define allowed vocabulary for name-value combinations (for example, Oxygen Utilization can be facultatively anaerobic, obligately aerobic, or aerobic) through an administrative page. I'm not sure what to make of an entry like on row 40, "Organisms are aerobic, microaerobic, facultatively anaerobic, or chemoorganotrophic, having both respiratory and fermentative types of metabolism"

We can use the existing system for allowed values, like https://bugsigdb.org/Help:Admin . You can either provide that vocabulary, or modify the spreadsheet for additional data validation dropdowns.

5. @tosfos since we have a study data model, should we somehow use that for the Taxon[Attribute source] values? However, these may not have an associated PMID.

We can certainly link to specific Study pages. If we want to keep the data as not bugsigdb-specific as possible, I guess the Attribute source can instead be set to a PMID, DOI, etc (whatever is available) and then if there is a Study that shares the same ID we can automatically link the two.

tosfos commented 3 years ago

@kbeckenrode

1. For manual curation, I think having the taxon ID written out could be helpful. But, maybe we re-phrase that column to [Name] instead.

I think what @lgeistlinger means is that once we have an NCBI ID column, we can look up the Genus name on the fly, so we don't need to manually include that data in the spreadsheet. Even if it's not what he means, it's true! :smile: So we don't really need the genus column. But there's no harm in leaving it in and we'll just ignore it on the import.

2. The only time I can see the "Taxon[Attribute name]" being helpful is when there are many attributes to choose from, like, shape or even oxygen utilization. But, you're right, the two columns can probably be merged.

We can leave this as two columns but just change the column heading. So instead of:

Taxon[Attribute name]	respiration
Oxygen Utilization	facultatively anaerobic
Oxygen Utilization	obligately aerobic

and

Taxon[Attribute name]	shapes
shape	cells are mainly cocci
shape	rod-shaped cells

just change to:

Taxon[Attribute name]	Taxon[Attribute value]
Oxygen Utilization	facultatively anaerobic
Oxygen Utilization	obligately aerobic

and

Taxon[Attribute name]	Taxon[Attribute value]
shape	cells are mainly cocci
shape	rod-shaped cells

keeping the first column as is, but changing the heading of all the second columns. But we can (and should) keep it as 2 columns.

3. I'll have to look at row 40 more closely, but my guess is a lot of variability in the species. But, will we be able to choose more than one value in some columns? Or do we have to provide every combination (yikes!).

We can definitely support more than one value, and that would be a much better system! Just separate each value with a comma or (if you want to use columns within a value) separate with a semi-colon. But we'll need to be careful about cells like:

cells are coccoid, highly irregular; occurring singly almost exclusively

We should try to keep these values as structured and non-descriptive as possible. So:

cells are spherical, oval, or rod-shaped

could be:

spherical; oval; rod-shaped

I'm not sure if this is possible scientifically.

I see some cells describe the cell shape and some describe the spore shape. In that case maybe (again, I don't know if this makes scientific sense) split that into 2 different attribute groups. Like:

ncbi_id	genus	Taxon[Attribute name]	Taxon[Attribute value]	Taxon[Attribute name]	Taxon[Attribute value]
46123	Abiotrophia	Cell shape	spherical; oval; rod-shaped	Spore shape	cylindrical; rod-shaped

I hope that makes sense.

lwaldron commented 3 years ago

This format makes perfect sense to me:

ncbi_id	genus	Taxon[Attribute name]	Taxon[Attribute value]	Taxon[Attribute name]	Taxon[Attribute value]
46123	Abiotrophia	Cell shape	spherical; oval; rod-shaped	Spore shape	cylindrical; rod-shaped

@kbeckenrode, unless you see a problem with this as a standard, would you revise the spreadsheet like this and repost?

kbeckenrode commented 3 years ago

Hi @tosfos @lwaldron Attached to this comment is the revised spreadsheet with the following updates

Re-named columns
Cleaned up attribute values and separated by ;
Added cell arrangements column because this was intermingled with cell shape. Shape and arrangement are different.

Let me know if this reflects the suggestions :)

Bergey's physiologies_20200905.xlsx

lwaldron commented 3 years ago

In our last meeting with Curtis Huttenhower's lab, we came to the conclusion that an inheritance feature in bugsigdb.org is probably unnecessary, as long as we have an efficient way to create annotations programmatically e.g. through the API. This is because Ancestral State Reconstruction (ASR) is likely a better way to infer properties both up and down the hierarchy than a simple yes/no, and these could be annotated like any other property, just using ASR as an evidence type.

So perhaps this simplifies what remains to be done @tosfos?

kbeckenrode commented 3 years ago

@tosfos

I updated the data model to reflect the new ASR evidence. So, we have two evidence types: EXP (direct experimental evidence) and everything else will be ASR. I also updated human-associated to host-associated.

Bugphys_edit_20200927.xlsx

waldronlab / BugSigDB

Taxon pages (was: Microbial annotations) #6