Closed lwaldron closed 3 years ago
So perhaps this simplifies what remains to be done @tosfos?
Yes.
I updated the data model to reflect the new ASR evidence.
Can we find a tax ID for the blank cells? Like Sulfurococcus or Bogoriella
For the "Gram stain" value field, right now many of them are "positive" or "negative", but many have notes. So that will make it difficult to semantically store this property. The result will me that if we want to query for "Gram stain" set to "negative" it may not catch all of the Taxa. For example:
76632 | Thermobacillus | Gram Stain | negative; phylogenetic position is Gram-positive |
---|
The best option might be to add a "Attribute value note" field so this would look like:
76632 | Thermobacillus | Gram Stain | negative | phylogenetic position is Gram-positive |
---|
Or should the "note" data be stored in the Taxon[Attribute context] field?
Also, some cells are set to "positive or negative" and some are set to "variable". Is this the same thing?
In the size field, some are set like:
0.3-0.4 × 3-7 µm
and some are set like:
0.6-0.8 µm wide; 1.6-3.0 µm in length
Is that 2 different representations of the same data type?
Again, if this data should be queryable, maybe the size field should be split into separate "width" and "length" attribute names.
Can you easily remove the excess whitespace? Like the leading whitespace in " round-ended rods; elongated rods; pointed ends"
Can this be more structured?
Host associated | Yes (both) |
---|
In other words, which values should this field allow? Is it a simple yes/no? Or should this be something like: None/partial/all ?
Overall, it's looking nice! I'll just note here that we can just move forward with the spreadsheet as-is, as long as it's OK for the field storage to be "dumb" in that it will just be stored as plain text, unqueryable.
@tosfos thank you for pointing out some of the messy spots of the database.
I will update the query fields that you mentioned, like Gram stain (shouldn't have any notes), extra spaces (can be removed), and adjust the measurement (length and widths) into separate columns. I'll repost with these changes soon.
Host-associated is tricky because I was thinking about 'how is it associated", like whether it is pathogen or a commensal. But, I think the type of relationship is not important here. I vote for a simple yes/no.
@tosfos I updated the Gram stain column, extra spaces, and general tidiness of curated a more consist vocabulary throughout. I started fixing the size columns, but they are messy. This will take me until next week to fix. So, I wanted to share at least a nicer version with you until then.
Host-associated: I actually added a selection for either pathogen, commensal, both, or no. I imagine this will change as we continue to develop the data model. Bugphys_edit_20201001.xlsx
Thanks Kelly! Updating these taxon annotations will be an ongoing process that we'll need a good workflow for. So IMO the most important thing now is for this to serve as an example for creating the taxon data model and our workflow for updates and additions. @tosfos if you have to drop some nonconforming entries it shouldn't matter, assuming we'll be making ongoing updates and additions in some programmatic way. We will also end up with some of the the same attributes annotated from different sources.
@tosfos Here is the data model with cleaned up sizes. I separated the widths and lengths into different columns. Bugphys_edit_20201006.xlsx
Updating these taxon annotations will be an ongoing process that we'll need a good workflow for
What exactly will need to be updated? The data structure or just the data entries? And do you anticipate importing data (or data structures) by using a spreadsheet like this? Or by using an in-wiki form?
Sorry if I was unclear. The column headings should remain constant across Attributes. So instead of:
Taxon[Attribute name] | Taxon[width] | Taxon[length] | Taxon[Attribute source] | Taxon[evidence] |
---|---|---|---|---|
size | 0.5-1.5 | 2-3 | Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. | EXP |
We should do:
Taxon[Attribute name] | Taxon[Attribute value] | Taxon[Attribute source] | Taxon[evidence] |
---|---|---|---|
width | 0.5-1.5 | Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. | EXP |
and as an additional set of columns: Taxon[Attribute name] | Taxon[Attribute value] | Taxon[Attribute source] | Taxon[evidence] |
---|---|---|---|
length | 2-3 | Garrity, G.M., Winters, M. & Searles, D.B. 2001a. Taxonomic Outline of the Procaryotic Genera. Bergey's Manual of Systematic Bacteriology, Second Edition. Release 1.0, Apr 2001: 1-39. | EXP |
Does that make sense?
Gram stain (shouldn't have any notes)
I might be missing something. I'm still seeing entries like
76632 | Thermobacillus | Gram Stain | negative; phylogenetic position is Gram-positive |
---|
Hi @tosfos--thanks for the feedback. Sorry for the error in the messy entries. I corrected the Gram Stain ones you pointed out. I also fixed the length and width column attribute names. Thanks for the clarity.
We will be expanding this dataset in data entries, which we would ideally like to enter using the Wiki instead of this spreadsheet. As we expand, we will also want to add new attributes too. So, being able to add new entries and attributes is important.
Speaking of expanding, the dataset could easily approach 100,000s of entries. Will that be a problem? The updated sheet I am attaching here has many new entries (~10,000) for a new attribute (antimicrobial resistance).
@tosfos We have an update to the data model to include two new Attribute columns: Taxon[Attribute inferred probability] and Taxon[Attribute inheritance probability]. I am sharing the updated data model with the new columns here.
@lwaldron and @lgeistlinger discussed more about how to incorporate ASR functionality into the data model. We are looking for a more automated way (maybe through use of an API) to update the data base, and I believe Levi is going chime in to add more clarity.
But going forward, just to be clear, it is important to have the following features:
Thanks @kbeckenrode. I confirm those two columns, with Taxon[Attribute inferred probability]
to be used when inferring attributes through Ancestral State Reconstruction (up the taxonomic hierarchy), and Taxon[Attribute inheritance probability]
to record the probability of an attribute being inherited (down the taxonomic hierarchy). I think that how will be calculated and used is still an open question, but this should provided enough flexibility. I would state those features like this:
Taxon[Attribute name] | Taxon[Attribute value] | Taxon[Attribute source] | Taxon[evidence] | Taxon[Attribute context] | Taxon[Attribute context source] | Taxon[Attribute inferred probability] | Taxon[Attribute inheritance probability] |
---|
Speaking of expanding, the dataset could easily approach 100,000s of entries. Will that be a problem?
No.
Regarding:
Taxon [other IDs name] | Taxon [other ID] |
---|
I'm assuming we won't need to support many IDs. We should just combine them into a single column like:
Taxon [GenomeID] , similar to Taxon[NCBI ID]
it is important to have the following features: Bulk upload to the wiki
Will this be via spreadsheets? Will they be modifying existing records? Or only adding new ones?
Ability to add single/multiple entries on the wiki
What do you mean by "multiple entries"? How is that different from adding a single entry a bunch of times?
Add new attribute columns
OK.
programmatic editing for adding new attributes and updating existing attributes in bulk
I'm assuming this is the same as the "Bulk upload to the wiki" above. Will it be via spreadsheets or through an API?
Everything else sounds good.
I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.
Normally I'd lean toward the NCBI ID, since it's numeric. But not every row has this filled in, and also it looks like we could be supporting multiple ID fields.
Right now each Taxon row has a name (Taxon[name]). Should that be the page title? Or is a taxon name not set in stone? Could it be given a different name by different people (in which case maybe we should add an Alias field)?
My note from March (when we were working with the Microbe Directory) says:
Column C is for the Page title, which will be important. It should come from either Column D (what NCBI calls this species) or Column E (what the Microbe Directory calls this species.)
Thoughts?
programmatic editing for adding new attributes and updating existing attributes in bulk
I'm assuming this is the same as the "Bulk upload to the wiki" above. Will it be via spreadsheets or through an API?
Yes, we're talking about the same thing. If this is feasible through the API that would seem ideal, if it could be done from anywhere and through an authenticated user account. A few example use cases:
So it seems like in general these do involve both adding and editing existing data. In both cases I would want to maintain versions / histories.
I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.
Let me discuss with others this Weds/Thurs and get back to you, would like to get this right the first time.
I'm wondering a bit about what should be the "primary key" (page title) of the Taxon data structure. I think there was some discussion about this.
@tosfos following up on this, an idea that should cover all cases (from @seandavi) E.g. for row 37 in the above Bugphys_edit_20201013.xlsx:
NAMESPACE::ID.NCBI_TAXON::1082934
with alias:
NAMESPACE::ID.GenomeID::1082934.5
So wherever NCBI taxid is available we'll use that for the primary key with other IDs present as aliases, but we can also include taxa that do not have an NCBI taxid or multiple sub-species strains that all belong to the same NCBI taxid by using a different identifier as the primary key.
1. There is a new version of a source database that annotates new taxa, adds new information about taxa that were present in the previous version, and updates the information that was available in the previous version.
This is tricky. What if there's new information about a taxon, but that taxon was separately edited in the wiki and now contains correct content that is not present in the source database? If we overwrite the taxon's page, we will lose that info.
2. There is an update to the NCBI taxonomy that changes the outputs of our ancestral state reconstruction and inheritance algorithms
We're "live" retrieving this information from NCBI and not storing it in the wiki itself. So this update already happens automatically in real-time.
3. After some manual curation of experimentally supported taxa attributes in the bugsigdb.org wiki, we want to download those and update our computationally predicted taxa attributes
I'm assuming this will be a CSV or similar. Shouldn't be an issue.
@tosfos following up on this, an idea that should cover all cases (from @seandavi) E.g. for row 37 in the above Bugphys_edit_20201013.xlsx:
NAMESPACE::ID.NCBI_TAXON::1082934
with alias:
NAMESPACE::ID.GenomeID::1082934.5
So wherever NCBI taxid is available we'll use that for the primary key with other IDs present as aliases, but we can also include taxa that do not have an NCBI taxid or multiple sub-species strains that all belong to the same NCBI taxid by using a different identifier as the primary key.
Sorry, I'm a bit confused about this. What happens if a taxon has neither an NCBI taxid nor a GenomeID? Or is that impossible because GenomeID is something we're making up as needed? And is the ".5" in the GenomeID representing the ID of a fifth sub-species? Would we then need sub-species of sub-species and add a decimal for each level?
@tosfos, it could be possible that we describe a taxon without either ID. NCBI taxon ID's are not the only identification standards available. GenomeID is a method of id'ing strains of a species. For example, "Acinetobacter baumannii" is a genus and species name with a NBCI taxon ID of 470. In addition, many strains of the A.baumannii species have been identified with sequencing. For example, A.baumannii stain 182 has a GenomeID 470.6501. I'm not sure how the decimal numbers are decided.
In addition, I'm adding this updated spreadsheet because the size attribute columns had an error. It is corrected here. Bugphys_edit_20201026.xlsx
GenomeID is made up as needed, and could be anything. This wasn't a good example because NAMESPACE::ID.GenomeID::1082934.5
is a child of NAMESPACE::ID.NCBI_TAXON::1082934
, because it is a substrain of the species.
A better example of true aliases would be between alternative taxonomic systems, e.g. SILVA, RDP, Greengenes, NCBI and OTT (https://www.ncbi.nlm.nih.gov/pubmed/28361695 "SILVA, RDP, Greengenes, NCBI and OTT— how do these taxonomies compare?"). These other taxonomies provide alternative names and identifiers, but many are aliases of each other.
This makes me wonder if "parent" should be a part of the taxon data model, so that the hierarchy is built into the data model rather than existing somehow separately. Potentially then "rank" would also be a part of the data model. In this approach the wiki would not be specifically tied to any taxonomy, even if our curation is based primarily on the NCBI taxonomy. We would create an initial set of "stub" pages with IDs, names, aliases, and parents based on all those above taxonomies. This approach seems to contrast against the model in the above spreadsheet, which has only taxid and name, and relies on NCBI for the taxonomy. This merits some discussion with @tosfos and others. Ike, we have a group meeting with everyone (including Curtis Huttenhower from Harvard) on Thursday at 11:30am, would you be able to attend at least the start of that to try to finalize the data model?
Sorry I wasn't able to make this meeting. In general mornings are difficult for me. I would say that creating our own taxonomy definitely gives us more control but would require us to keep it up to date. And how would we know what the correct taxonomy is correct? Wouldn't we just be checking the NCBI anyway?
No problem @tosfos. We've had a bunch of discussion in the meantime and are leaning towards keeping taxonomic information signature-centric, and having "phenotypic signatures" that are conceptually the same as the "experimental signatures" we currently have. Also I think you're right about just sticking to the NCBI taxonomy. Here are some editable slides summarizing this thinking, that would involve creating a "phenotypic signature" data model that is nearly the same as the current "experimental signature" model, and adding support for externally-calculated similarity scores to other signatures. Would be good to get your thoughts and discuss before implementing, but if you agree with the idea then I think we should be closer to finished than we thought at least in terms of data models. https://docs.google.com/presentation/d/1UJzfb5P505Swtd9jbs3oYNWFzpLsukqGBKHvERqxje8/edit?usp=sharing
Looks interesting. Can we schedule a call to discuss?
Yes - see if you can find a good time at calendly.com/lwaldron, otherwise I can free up more times.
Or, we have a regularly-scheduled group meeting Wednesdays at noon Eastern time if that's a good time for you.
We would like for users to be able to look up a specific Taxon and get links to the signatures where it exists in the database. I'm not sure whether this means just a search feature, or actual pages for taxa.
We would like for users to be able to look up a specific Taxon and get links to the signatures where it exists in the database. I'm not sure whether this means just a search feature, or actual pages for taxa.
It depends on if you're also storing data about the individual taxa. If we're storing data, they should get their own pages. If not, we can generate a query form with a result page.
Or, we have a regularly-scheduled group meeting Wednesdays at noon Eastern time if that's a good time for you.
How about Wednesday 11/25 at noon Eastern?
How about Wednesday 11/25 at noon Eastern?
That'd be great!
Our discussion of taxon information has changed significantly. The implementation we discussed yesterday will provide "fake" taxon pages without a taxon data model, that displays:
We estimate around 8 hours for this, including some styling. Should we go ahead?
Yes, you can go ahead.
After thinking about this some more, I should confirm something. We're not actually going to import the entire NCBI taxonomy, correct? All we're displaying is data that's already in the wiki because the wiki contains a signature with this NCBI entry (or a parent/grandparent of a signature's NCBI entry).
Is this correct?
That seems correct to me. The entire NCBI taxonomy would have to be pruned of non-microbial taxa, and I don't see much point of having pages for taxa that are not contained or directly parent/child of one in a signature.
At some point in the future when we create phenotypic signatures, we could create signatures for things like "human gut microbiome", "human oral microbiome", "human microbiome", "environmental microbiomes", "all bacteria", "all archaea" etc that would bring in everything else if we wanted to.
We created an initial version of the Taxon "fake pages". You can see an example here. We're going to remove the query lookup (input field and "run query" button) from the top of that page soon (maybe before you read this) so that it will look like a true "fake page" instead of a query results page.
We're also going to modify Signature pages to link each NCBI item to a Taxon page. But it will be pretty busy then. I'd suggest that we remove the current NCBI link and maybe also the current hover action. We can hold off on that decision until we add the Taxon links to the Signature pages and see how it looks.
I feel like we can be much more intuitive with the Lineage display and also adding a query for immediate descendants of this Taxon. For a page like "Clostridiales", I'd suggest something like:
superkingdom Bacteria phylum Firmicutes class Clostridia
Order: Clostridiales NCBI ID: 186802
Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae
Signatures containing the Clostridiales taxon (186802)
Alternatively, we can move the Lineage to the right side of the page into an "infobox" and do something like
superkingdom Bacteria phylum Firmicutes class Clostridia Order Clostridiales
Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae
And finally, we can do a breadcrumbs-style lineage like:
Bacteria > Firmicutes > Clostridia > Clostridiales
Order: Clostridiales NCBI ID: 186802
Family taxa in this Order: Lachnospiraceae, Peptostreptococcus, Ruminococcaceae
Please let me know which of these 3 options you prefer. We can mock-up any or all of these if you'd like.
I kind of like the breadcrumbs-style lineage, since it seems the most compact without losing any essential information.
A couple more improvements to the taxon "fake pages" that would be nice if practical:
We added the breadcrumbs-style lineage together with descendant queries. We now have a really nice drilldown-style display. Please see here.
We'll see what's involved in adding the 2 requested grouping features.
How do we begin annotating the microbial taxonomy with morphological and physiological properties? For example:
We would want to be able annotate individual taxa or hierarchical clades of the taxonomy. This is the high-priority item
Eventually we would want to be able to export and analyze these like signatures, at any level of the taxonomy. This is lower-priority.