waldronlab / BugSigDB

A microbial signatures database
https://bugsigdb.org
7 stars 6 forks source link

Taxon pages (was: Microbial annotations) #6

Closed lwaldron closed 3 years ago

lwaldron commented 4 years ago

How do we begin annotating the microbial taxonomy with morphological and physiological properties? For example:

We would want to be able annotate individual taxa or hierarchical clades of the taxonomy. This is the high-priority item

Eventually we would want to be able to export and analyze these like signatures, at any level of the taxonomy. This is lower-priority.

tosfos commented 4 years ago

So perhaps this simplifies what remains to be done @tosfos?

Yes.

I updated the data model to reflect the new ASR evidence.

Can we find a tax ID for the blank cells? Like Sulfurococcus or Bogoriella

For the "Gram stain" value field, right now many of them are "positive" or "negative", but many have notes. So that will make it difficult to semantically store this property. The result will me that if we want to query for "Gram stain" set to "negative" it may not catch all of the Taxa. For example:

76632 Thermobacillus Gram Stain negative; phylogenetic position is Gram-positive

The best option might be to add a "Attribute value note" field so this would look like:

76632 Thermobacillus Gram Stain negative phylogenetic position is Gram-positive

Or should the "note" data be stored in the Taxon[Attribute context] field?

Also, some cells are set to "positive or negative" and some are set to "variable". Is this the same thing?

tosfos commented 4 years ago

In the size field, some are set like:

0.3-0.4 × 3-7 µm

and some are set like:

0.6-0.8 µm wide; 1.6-3.0 µm in length

Is that 2 different representations of the same data type?

Again, if this data should be queryable, maybe the size field should be split into separate "width" and "length" attribute names.

tosfos commented 4 years ago

Can you easily remove the excess whitespace? Like the leading whitespace in " round-ended rods; elongated rods; pointed ends"

tosfos commented 4 years ago

Can this be more structured?

Host associated Yes (both)

In other words, which values should this field allow? Is it a simple yes/no? Or should this be something like: None/partial/all ?

tosfos commented 4 years ago

Overall, it's looking nice! I'll just note here that we can just move forward with the spreadsheet as-is, as long as it's OK for the field storage to be "dumb" in that it will just be stored as plain text, unqueryable.

kbeckenrode commented 4 years ago

@tosfos thank you for pointing out some of the messy spots of the database.

I will update the query fields that you mentioned, like Gram stain (shouldn't have any notes), extra spaces (can be removed), and adjust the measurement (length and widths) into separate columns. I'll repost with these changes soon.

Host-associated is tricky because I was thinking about 'how is it associated", like whether it is pathogen or a commensal. But, I think the type of relationship is not important here. I vote for a simple yes/no.

kbeckenrode commented 4 years ago

@tosfos I updated the Gram stain column, extra spaces, and general tidiness of curated a more consist vocabulary throughout. I started fixing the size columns, but they are messy. This will take me until next week to fix. So, I wanted to share at least a nicer version with you until then.

Host-associated: I actually added a selection for either pathogen, commensal, both, or no. I imagine this will change as we continue to develop the data model. Bugphys_edit_20201001.xlsx

lwaldron commented 4 years ago

Thanks Kelly! Updating these taxon annotations will be an ongoing process that we'll need a good workflow for. So IMO the most important thing now is for this to serve as an example for creating the taxon data model and our workflow for updates and additions. @tosfos if you have to drop some nonconforming entries it shouldn't matter, assuming we'll be making ongoing updates and additions in some programmatic way. We will also end up with some of the the same attributes annotated from different sources.

kbeckenrode commented 4 years ago

@tosfos Here is the data model with cleaned up sizes. I separated the widths and lengths into different columns. Bugphys_edit_20201006.xlsx

tosfos commented 4 years ago

Updating these taxon annotations will be an ongoing process that we'll need a good workflow for

What exactly will need to be updated? The data structure or just the data entries? And do you anticipate importing data (or data structures) by using a spreadsheet like this? Or by using an in-wiki form?

tosfos commented 4 years ago

Sorry if I was unclear. The column headings should remain constant across Attributes. So instead of:

Taxon[Attribute name] Taxon[width] Taxon[length] Taxon[Attribute source] Taxon[evidence]
size 0.5-1.5 2-3 Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. EXP

We should do:

Taxon[Attribute name] Taxon[Attribute value] Taxon[Attribute source] Taxon[evidence]
width 0.5-1.5 Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. EXP
and as an additional set of columns: Taxon[Attribute name] Taxon[Attribute value] Taxon[Attribute source] Taxon[evidence]
length 2-3 Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. EXP

Does that make sense?

tosfos commented 4 years ago

Gram stain (shouldn't have any notes)

I might be missing something. I'm still seeing entries like

76632     Thermobacillus Gram Stain negative; phylogenetic position is Gram-positive
kbeckenrode commented 4 years ago

Hi @tosfos--thanks for the feedback. Sorry for the error in the messy entries. I corrected the Gram Stain ones you pointed out. I also fixed the length and width column attribute names. Thanks for the clarity.

We will be expanding this dataset in data entries, which we would ideally like to enter using the Wiki instead of this spreadsheet. As we expand, we will also want to add new attributes too. So, being able to add new entries and attributes is important.

Speaking of expanding, the dataset could easily approach 100,000s of entries. Will that be a problem? The updated sheet I am attaching here has many new entries (~10,000) for a new attribute (antimicrobial resistance).

Bugphys_edit_20201012.xlsx

kbeckenrode commented 4 years ago

@tosfos We have an update to the data model to include two new Attribute columns: Taxon[Attribute inferred probability] and Taxon[Attribute inheritance probability]. I am sharing the updated data model with the new columns here.

@lwaldron and @lgeistlinger discussed more about how to incorporate ASR functionality into the data model. We are looking for a more automated way (maybe through use of an API) to update the data base, and I believe Levi is going chime in to add more clarity.

But going forward, just to be clear, it is important to have the following features:

  1. Bulk upload to the wiki
  2. Ability to add single/multiple entries on the wiki
  3. Add new attribute columns

Bugphys_edit_20201013.xlsx

lwaldron commented 4 years ago

Thanks @kbeckenrode. I confirm those two columns, with Taxon[Attribute inferred probability] to be used when inferring attributes through Ancestral State Reconstruction (up the taxonomic hierarchy), and Taxon[Attribute inheritance probability] to record the probability of an attribute being inherited (down the taxonomic hierarchy). I think that how will be calculated and used is still an open question, but this should provided enough flexibility. I would state those features like this:

  1. programmatic editing for adding new attributes and updating existing attributes in bulk
  2. manual wiki editing as per usual
  3. ability to record more than one Attribute {value, source, context, context source, inferred probability, inheritance probability} and evidence for the same Taxon[NCBI ID] / Taxon[Attribute name] combination. This reflects the existence of complementary or conflicting pieces of evidence, and resolving them will be the job of future / downstream analysis.
  4. Add a new Attribute (with all the associated data, below) to a Taxon, manually or programatically
Taxon[Attribute name] Taxon[Attribute value] Taxon[Attribute source] Taxon[evidence] Taxon[Attribute context] Taxon[Attribute context source] Taxon[Attribute inferred probability] Taxon[Attribute inheritance probability]
tosfos commented 4 years ago

Speaking of expanding, the dataset could easily approach 100,000s of entries. Will that be a problem?

No.

Regarding:

Taxon [other IDs name] Taxon [other ID]

I'm assuming we won't need to support many IDs. We should just combine them into a single column like:

Taxon [GenomeID] , similar to Taxon[NCBI ID]

tosfos commented 4 years ago

it is important to have the following features: Bulk upload to the wiki

Will this be via spreadsheets? Will they be modifying existing records? Or only adding new ones?

Ability to add single/multiple entries on the wiki

What do you mean by "multiple entries"? How is that different from adding a single entry a bunch of times?

Add new attribute columns

OK.

tosfos commented 4 years ago

programmatic editing for adding new attributes and updating existing attributes in bulk

I'm assuming this is the same as the "Bulk upload to the wiki" above. Will it be via spreadsheets or through an API?

Everything else sounds good.

tosfos commented 4 years ago

I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.

Normally I'd lean toward the NCBI ID, since it's numeric. But not every row has this filled in, and also it looks like we could be supporting multiple ID fields.

Right now each Taxon row has a name (Taxon[name]). Should that be the page title? Or is a taxon name not set in stone? Could it be given a different name by different people (in which case maybe we should add an Alias field)?

My note from March (when we were working with the Microbe Directory) says:

Column C is for the Page title, which will be important. It should come from either Column D (what NCBI calls this species) or Column E (what the Microbe Directory calls this species.)

Thoughts?

lwaldron commented 4 years ago

programmatic editing for adding new attributes and updating existing attributes in bulk

I'm assuming this is the same as the "Bulk upload to the wiki" above. Will it be via spreadsheets or through an API?

Yes, we're talking about the same thing. If this is feasible through the API that would seem ideal, if it could be done from anywhere and through an authenticated user account. A few example use cases:

  1. There is a new version of a source database that annotates new taxa, adds new information about taxa that were present in the previous version, and updates the information that was available in the previous version.
  2. There is an update to the NCBI taxonomy that changes the outputs of our ancestral state reconstruction and inheritance algorithms
  3. After some manual curation of experimentally supported taxa attributes in the bugsigdb.org wiki, we want to download those and update our computationally predicted taxa attributes

So it seems like in general these do involve both adding and editing existing data. In both cases I would want to maintain versions / histories.

lwaldron commented 4 years ago

I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.

Let me discuss with others this Weds/Thurs and get back to you, would like to get this right the first time.

lwaldron commented 3 years ago

I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.

@tosfos following up on this, an idea that should cover all cases (from @seandavi) E.g. for row 37 in the above Bugphys_edit_20201013.xlsx:

NAMESPACE::ID.NCBI_TAXON::1082934

with alias:

NAMESPACE::ID.GenomeID::1082934.5

So wherever NCBI taxid is available we'll use that for the primary key with other IDs present as aliases, but we can also include taxa that do not have an NCBI taxid or multiple sub-species strains that all belong to the same NCBI taxid by using a different identifier as the primary key.

tosfos commented 3 years ago
1. There is a new version of a source database that annotates new taxa, adds new information about taxa that were present in the previous version, and updates the information that was available in the previous version.

This is tricky. What if there's new information about a taxon, but that taxon was separately edited in the wiki and now contains correct content that is not present in the source database? If we overwrite the taxon's page, we will lose that info.

2. There is an update to the NCBI taxonomy that changes the outputs of our ancestral state reconstruction and inheritance algorithms

We're "live" retrieving this information from NCBI and not storing it in the wiki itself. So this update already happens automatically in real-time.

3. After some manual curation of experimentally supported taxa attributes in the bugsigdb.org wiki, we want to download those and update our computationally predicted taxa attributes

I'm assuming this will be a CSV or similar. Shouldn't be an issue.

tosfos commented 3 years ago

@tosfos following up on this, an idea that should cover all cases (from @seandavi) E.g. for row 37 in the above Bugphys_edit_20201013.xlsx:

NAMESPACE::ID.NCBI_TAXON::1082934

with alias:

NAMESPACE::ID.GenomeID::1082934.5

So wherever NCBI taxid is available we'll use that for the primary key with other IDs present as aliases, but we can also include taxa that do not have an NCBI taxid or multiple sub-species strains that all belong to the same NCBI taxid by using a different identifier as the primary key.

Sorry, I'm a bit confused about this. What happens if a taxon has neither an NCBI taxid nor a GenomeID? Or is that impossible because GenomeID is something we're making up as needed? And is the ".5" in the GenomeID representing the ID of a fifth sub-species? Would we then need sub-species of sub-species and add a decimal for each level?

kbeckenrode commented 3 years ago

@tosfos, it could be possible that we describe a taxon without either ID. NCBI taxon ID's are not the only identification standards available. GenomeID is a method of id'ing strains of a species. For example, "Acinetobacter baumannii" is a genus and species name with a NBCI taxon ID of 470. In addition, many strains of the A.baumannii species have been identified with sequencing. For example, A.baumannii stain 182 has a GenomeID 470.6501. I'm not sure how the decimal numbers are decided.

In addition, I'm adding this updated spreadsheet because the size attribute columns had an error. It is corrected here. Bugphys_edit_20201026.xlsx

lwaldron commented 3 years ago

GenomeID is made up as needed, and could be anything. This wasn't a good example because NAMESPACE::ID.GenomeID::1082934.5 is a child of NAMESPACE::ID.NCBI_TAXON::1082934, because it is a substrain of the species.

A better example of true aliases would be between alternative taxonomic systems, e.g. SILVA, RDP, Greengenes, NCBI and OTT (https://www.ncbi.nlm.nih.gov/pubmed/28361695 "SILVA, RDP, Greengenes, NCBI and OTT— how do these taxonomies compare?"). These other taxonomies provide alternative names and identifiers, but many are aliases of each other.

This makes me wonder if "parent" should be a part of the taxon data model, so that the hierarchy is built into the data model rather than existing somehow separately. Potentially then "rank" would also be a part of the data model. In this approach the wiki would not be specifically tied to any taxonomy, even if our curation is based primarily on the NCBI taxonomy. We would create an initial set of "stub" pages with IDs, names, aliases, and parents based on all those above taxonomies. This approach seems to contrast against the model in the above spreadsheet, which has only taxid and name, and relies on NCBI for the taxonomy. This merits some discussion with @tosfos and others. Ike, we have a group meeting with everyone (including Curtis Huttenhower from Harvard) on Thursday at 11:30am, would you be able to attend at least the start of that to try to finalize the data model?

tosfos commented 3 years ago

Sorry I wasn't able to make this meeting. In general mornings are difficult for me. I would say that creating our own taxonomy definitely gives us more control but would require us to keep it up to date. And how would we know what the correct taxonomy is correct? Wouldn't we just be checking the NCBI anyway?

lwaldron commented 3 years ago

No problem @tosfos. We've had a bunch of discussion in the meantime and are leaning towards keeping taxonomic information signature-centric, and having "phenotypic signatures" that are conceptually the same as the "experimental signatures" we currently have. Also I think you're right about just sticking to the NCBI taxonomy. Here are some editable slides summarizing this thinking, that would involve creating a "phenotypic signature" data model that is nearly the same as the current "experimental signature" model, and adding support for externally-calculated similarity scores to other signatures. Would be good to get your thoughts and discuss before implementing, but if you agree with the idea then I think we should be closer to finished than we thought at least in terms of data models. https://docs.google.com/presentation/d/1UJzfb5P505Swtd9jbs3oYNWFzpLsukqGBKHvERqxje8/edit?usp=sharing

tosfos commented 3 years ago

Looks interesting. Can we schedule a call to discuss?

lwaldron commented 3 years ago

Yes - see if you can find a good time at calendly.com/lwaldron, otherwise I can free up more times.

lwaldron commented 3 years ago

Or, we have a regularly-scheduled group meeting Wednesdays at noon Eastern time if that's a good time for you.

lwaldron commented 3 years ago

We would like for users to be able to look up a specific Taxon and get links to the signatures where it exists in the database. I'm not sure whether this means just a search feature, or actual pages for taxa.

tosfos commented 3 years ago

We would like for users to be able to look up a specific Taxon and get links to the signatures where it exists in the database. I'm not sure whether this means just a search feature, or actual pages for taxa.

It depends on if you're also storing data about the individual taxa. If we're storing data, they should get their own pages. If not, we can generate a query form with a result page.

tosfos commented 3 years ago

Or, we have a regularly-scheduled group meeting Wednesdays at noon Eastern time if that's a good time for you.

How about Wednesday 11/25 at noon Eastern?

lwaldron commented 3 years ago

How about Wednesday 11/25 at noon Eastern?

That'd be great!

lwaldron commented 3 years ago

Our discussion of taxon information has changed significantly. The implementation we discussed yesterday will provide "fake" taxon pages without a taxon data model, that displays:

  1. NCBI ID, name, and link to NCBI
  2. full lineage, with links to bugsigdb.org
  3. links to signatures containing that taxon
    • there is a question of whether to display signatures containing children or parents of that taxon, or only exactly that taxon.
    • It sounds like this is not a difficult decision to change later, so I am inclined just to take the simplest option, then consider changes later based on user feedback.
tosfos commented 3 years ago

We estimate around 8 hours for this, including some styling. Should we go ahead?

lwaldron commented 3 years ago

Yes, you can go ahead.

tosfos commented 3 years ago

After thinking about this some more, I should confirm something. We're not actually going to import the entire NCBI taxonomy, correct? All we're displaying is data that's already in the wiki because the wiki contains a signature with this NCBI entry (or a parent/grandparent of a signature's NCBI entry).

Is this correct?

lwaldron commented 3 years ago

That seems correct to me. The entire NCBI taxonomy would have to be pruned of non-microbial taxa, and I don't see much point of having pages for taxa that are not contained or directly parent/child of one in a signature.

At some point in the future when we create phenotypic signatures, we could create signatures for things like "human gut microbiome", "human oral microbiome", "human microbiome", "environmental microbiomes", "all bacteria", "all archaea" etc that would bring in everything else if we wanted to.

tosfos commented 3 years ago

We created an initial version of the Taxon "fake pages". You can see an example here. We're going to remove the query lookup (input field and "run query" button) from the top of that page soon (maybe before you read this) so that it will look like a true "fake page" instead of a query results page.

image

tosfos commented 3 years ago

We're also going to modify Signature pages to link each NCBI item to a Taxon page. But it will be pretty busy then. I'd suggest that we remove the current NCBI link and maybe also the current hover action. We can hold off on that decision until we add the Taxon links to the Signature pages and see how it looks.

tosfos commented 3 years ago

I feel like we can be much more intuitive with the Lineage display and also adding a query for immediate descendants of this Taxon. For a page like "Clostridiales", I'd suggest something like:

superkingdom Bacteria phylum Firmicutes class Clostridia


Order: Clostridiales NCBI ID: 186802


Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae

Signatures containing the Clostridiales taxon (186802)

  1. 16S gut community of the Cameron County Hispanic Cohort/Experiment 40/Signature 67
  2. 16S rRNA amplicon sequencing identifies microbiota associated with oral cancer, human papilloma virus infection and surgical treatment/Experiment 291/Signature 485
  3. A prospective study to examine the association of the urinary and fecal microbiota with prostate cancer diagnosis after transrectal biopsy of the prostate using 16sRNA gene analysis/Experiment 373/Signature 633 ...
tosfos commented 3 years ago

Alternatively, we can move the Lineage to the right side of the page into an "infobox" and do something like

superkingdom Bacteria phylum Firmicutes class Clostridia Order Clostridiales

Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae

tosfos commented 3 years ago

And finally, we can do a breadcrumbs-style lineage like:

Bacteria > Firmicutes > Clostridia > Clostridiales

Order: Clostridiales NCBI ID: 186802

Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae

tosfos commented 3 years ago

Please let me know which of these 3 options you prefer. We can mock-up any or all of these if you'd like.

lwaldron commented 3 years ago

I kind of like the breadcrumbs-style lineage, since it seems the most compact without losing any essential information.

lwaldron commented 3 years ago

A couple more improvements to the taxon "fake pages" that would be nice if practical:

  1. group signatures by those with a direct relationship (ie the exact taxon is present in the signature), and those with a relationship only by inheritance. Would put direct relationships in a group first, followed by signatures by inheritance.
  2. list one row per study, with another column with just numbers and hyperlinks to signatures within that study? For example on Bacillaceae taxon (186817), somehow grouping rows 2-6 onto one row.
tosfos commented 3 years ago

We added the breadcrumbs-style lineage together with descendant queries. We now have a really nice drilldown-style display. Please see here.

tosfos commented 3 years ago

We'll see what's involved in adding the 2 requested grouping features.