waldronlab / BugSigDB

A microbial signatures database
https://bugsigdb.org
7 stars 6 forks source link

Quality control #156

Closed lgeistlinger closed 1 year ago

lgeistlinger commented 1 year ago

One reviewer raised the following point:

The one concern I had was with quality control. I noted that undergraduate students were the ones inputting the data, which is an excellent volunteer idea. However, I ran across one study that had very odd results. This paper on “The differential distribution of bacteria between cancerous and noncancerous material” included a lot of really bizarre organisms https://www.nature.com/articles/s41598-018-38031-2 I checked the entry on this paper https://bugsigdb.org/Study_199/Experiment_1 and it definitely matched the results in the paper, but the organisms cited in the study read like a list of contaminants not found in mammals. Groups like Aquificae and Thermaceae are normally found in hot springs or harsh environments. Also, groups like Crenarcheaota and Sphingobacteria – these are not human associated microbes.

A study like this should really be flagged as unusual in the system. Both the taxonomic level was highly unusual as was the taxonomic composition. I think the authors should consider developing a method for identifying studies like these that are so clearly different.

lgeistlinger commented 1 year ago

This is a tricky one and, after discussion with @lwaldron, it is not 100% clear how to best address this. The options include but are not limited to

  1. checking against published lists of contaminants such as eg Eisenhofer et al, 2019,
  2. checking against body-site typical microbiome signatures compiled from curatedMetagenomicData, and
  3. a data-driven approach that checks against all signatures of the same species and body site in BugSigDB

I believe 3. would be the way to go. Maybe also @nsegata @chuttenh have thoughts on how to best address this?

lwaldron commented 1 year ago

Some of Wikipedia’s page “issues” tagging would relevant for study-level issues: https://en.wikipedia.org/wiki/Wikipedia:Tagging_pages_for_problems

lwaldron commented 1 year ago

@tosfos could you advise on the effectiveness / feasibility of a couple possible complementary features?

  1. Wikipedia-style tags for studies (probably don't need to go down to signature level) to indicate that there are potential problems with the results. Maybe even just general, like "Quality issues raised"
  2. annotating individual taxa that appear in one or more lists that we maintain. For example, we could keep one list of common contaminants (although these would be normal to find on e.g. skin), and another list of definitely non human-associated taxa. These would show up as some kind of linked annotation to these taxa anywhere they appear in the wiki

My only concerns with 3 @lgeistlinger is that it would require regular updating, and would probably be ineffective for less-studied host species or body sites (such as the uterus tissues in the study identified by the reviewer). It still could be quite cool though, if it created a flag like "this is an unexpected signature" and linked to some more information about what is unexpected about it. This would likely be more informative for feces than e.g. for meconium.

tosfos commented 1 year ago

We'll review these options and make some recommendations. Is this on a tight timeline?

lwaldron commented 1 year ago

We'd like to submit our revisions in January - not too tight or too long.

tosfos commented 1 year ago

annotating individual taxa that appear in one or more lists that we maintain. For example, we could keep one list of common contaminants (although these would be normal to find on e.g. skin), and another list of definitely non human-associated taxa. These would show up as some kind of linked annotation to these taxa anywhere they appear in the wiki

I don't understand this. Can you please provide more information?

lwaldron commented 1 year ago

Sorry for the delay - I understand this idea to be to highlight some taxa any time they appear in a signature (with an icon, asterisk, or other symbol, or I could imagine even displaying their names differently to add annotations). For example we could make a list of taxa that do not live in human hosts, so that any time one of them appears in a signature, it is immediately visible as a probable error or contaminant.

The reviewer’s examples were good - “Groups like Aquificae and Thermaceae are normally found in hot springs or harsh environments. Also, groups like Crenarcheaota and Sphingobacteria – these are not human associated microbes.” Many curators, authors, and wiki users won’t recognize these out-of-place microbes, so if we can find a way to clearly highlight them, it will add value as a form of built-in quality control for results.

tosfos commented 1 year ago

For option 1, it makes sense to build on top of the existing Review system. We would create a dropdown with a few tags. When a user selects one of these tags, the Review state would be changed to "Incorrect" as of today's date. But the review would also store the tag, indicating the reason that a Review is now needed. The Study page would also show the tag when it is loaded. If that sounds good, please provide a list of tags.

For option 2, we can do that. The idea would be to impose tests on each taxon that appears within a Signature. If any of the tests fail, we would display some sort of icon indicating which test failed. Should that information also be included in the CSV export?

lwaldron commented 1 year ago

Sounds great @tosfos. I don't know a complete list of tags offhand, but here are a few potentially useful ones.

For your export question about 2, how about just providing separate CSV exports of the lists used for these tests? That way these are easy to check in their entirety and use in any way desired in downstream applications, and it doesn't create new exceptions for applications using the standard CSV exports. A couple lists we could produce are:

lwaldron commented 1 year ago

@tosfos could you advise on the likely timeline for this? It's the last remaining item for review and we'd like to resubmit soon.

tosfos commented 1 year ago

Got it. We will put a rush on this. I added the priority label. I hope to have it done this week.

tosfos commented 1 year ago

Can you provide the CSVs for the tests we should be performing?

tosfos commented 1 year ago

We are already running lots of queries on each page, and we're concerned that running tests on every signature at page load could slow down performance. We can try it. Or we propose these alternatives:

  1. Allow users to manually trigger the tests by clicking a "Validate signatures" (or similar) button. This will bring up another page with the results of the tests.
  2. Create a script that will browse the wiki every so often. It will perform the tests required and will edit any page where tests fail so that it includes the warning.

Please let me know how to proceed.

lwaldron commented 1 year ago

Working on the lists - will they be administrator-editable so we can provide draft lists and update them? These two lists are quite different, in that:

  1. not a host-associated microbe - these are "red flags" anywhere they appear, and should be done by periodically browsing the wiki as in your option 2.
  2. common contaminant - these will appear everywhere, but it's context-dependent whether they are really present or a contaminant. E.g. skin-associated microbes in the stool are a contaminant, but on skin samples may be real or a contaminant. They could be periodically browsed in the same way as not host-associated, but we want to think about how to present them so as not to suggest that perfectly reasonable results are contamination. Maybe one option is just to annotate some common body site-associated microbes (e.g. feces and skin), and let not host-associated take care of the rest of contamination problems.
tosfos commented 1 year ago

Update on the Review system.

Here are some screenshots from our development server: image image

Some notes:

  1. Clicking on tags does not close the dropdown. That way, you can send several instructions to tag or untag. One tricky usability issue is that clicking the toggles doesn't actually apply the tags until clicking "Apply tags". This is much better on the technical side. But if users find this confusing, it may need some improvement.
  2. Adding a tag triggers an Incorrect status.
  3. The (Flag for review / Mark reviewed) button works as previously, but it was moved to the Quality control dropdown.
  4. Removal of a tag does not affect the status.
  5. The active tags are highlighted in the dropdown.
  6. The Review status is shown in the old place right on the Study page.
tosfos commented 1 year ago

will they be administrator-editable so we can provide draft lists and update them

Should be possible.

lwaldron commented 1 year ago

Screenshots of the updated review system look great, @tosfos! I'd say push into production and we'll see if usability issues arise, but it seems OK to me.

On the individual-taxon flags, we have been having a lot of offline discussion on the fact that what is a likely contaminant in one body site is a prevalent bacteria in another body site. Lists of common contaminants contain taxa that are present in most feces signatures but are probably not the result of contamination in that context. The simplest solution we've thought of is to provide several signatures, such as:

  1. prevalent in feces
  2. prevalent in oral cavity
  3. prevalent on skin
  4. common contaminant (excludes anything included in 1-3)
  5. not host-associated (made of exotic clades like thermophiles and things found in ocean water, and a list of things that have so far been seen very rarely in bugsigdb)

This simpler system would flag taxa with at least one of these in almost every signature, and it would be up to the reader to interpret that in context.

A more complex alternative would be to have site-specific flags, so that common contaminants that are prevalent for a particular body site are not flagged in signatures of that body site. This would be tricky to define for rarely-studied or low microbial content body sites, and easier to define for the well-studied high microbial content sites. For this reason, and because wiki implementation would presumably be easier & quicker, I am leaning towards the simpler system. Thoughts, @lgeistlinger @tosfos ? I think @lgeistlinger or I can provide preliminary lists 1-5 shortly.

One question for @tosfos - for items 4-5 above, is it straightforward for you to include children of taxa we provide, or should we provide large lists including all child taxa?

lgeistlinger commented 1 year ago

I agree that the screenshots look great! I think denoting the status as "incorrect" is a little bit too hard and I think the suggested "Quality issues raised", or for brevity "Quality issues", would be better for denoting such a status.

Another thing is that we now start to depart from simply recording results, and start to get into the business of assessing study quality directly on bugsigdb.org. We are not necessarily making a lot of friends like that as nobody likes their study to be flagged in such a way. It's a little bit like we are applying our own review system after the study has already been peer-reviewed. I can see the reasons for doing so (and the reviewer wants us to do so), I just think we want to be careful with assigning these flags, and the difficulty of assigning clear-cut contaminant taxa speaks a little to how hairy this could get.

@lwaldron With regard to the prevalence signatures (1.-3.) - those are human-specific, right? Or would we want to apply them as is also to non-human studies in BugSigDB? I guess a contaminant in a human-study is also a contaminant in a study on mice.

One question for @tosfos - for items 4-5 above, is it straightforward for you to include children of taxa we provide, or should we provide large lists including all child taxa?

And should we provide those as NCBI taxonomy IDs or simple taxonomic names or both?

tosfos commented 1 year ago

Let's use IDs please

tosfos commented 1 year ago

We decided to separate the "Review" status display and the "Quality issues" display. Conceptually, they note completely different status types. “Needs review” means that someone on the team needs to review the data entry to make sure it was done correctly. If a study has quality issues, it might have been entered completely accurately. The quality issues are with the Study itself, not in the way it was entered into the wiki. For example, “Results are suspect (various reasons)” means that the wiki’s editors looked at the original study and decided the results look suspicious. So nobody needs to review anything and the Study can stay as is.

This is still in progress, but here are updated screenshots. Is this a better approach? image

image

lgeistlinger commented 1 year ago

Is this a better approach?

I think this makes a lot of sense.

lgeistlinger commented 1 year ago

@tosfos Here is a list of common contaminant genera for testing purposes.

This list has been derived from the OpenContaminant Blacklist using all genera with a score >= 1. We removed from this list all genera that were found with >= 50% prevalence in signatures of typical body site microbiomes of healthy children and adults as recorded in BugSigDB's Study_562 and Study_608.

Note that also all children of these genera in the NCBI taxonomy (ie. species and strains) should be considered contaminants.

contaminants.csv

lwaldron commented 1 year ago

Is this a better approach?

I agree too.

@tosfos Here is a list of common contaminant genera for testing purposes.

Great - just to chime in that I hope this can be stored in a way that administrators can update it in the future - "common contaminants" can be a subjective thing to define and will probably need updating in the future.

tosfos commented 1 year ago

for items 4-5 above, is it straightforward for you to include children of taxa we provide, or should we provide large lists including all child taxa?

I see I missed this question. We'll try using just the parent IDs and see how it goes.

tosfos commented 1 year ago

There's been a lot of discussion and I'm a bit lost. Can you please clarify what we should do with this CSV? Will we be putting a warning everywhere one of the Genera (or its children) appears on BugSigDB? Where should the warning go? What should the text be?

Thanks.

lgeistlinger commented 1 year ago

Sorry for the confusion. In brief, the idea is to use this csv to tag individual taxa contained in BugSigDB's signature pages (including their display on the study page).

Screen Shot 2023-02-14 at 2 00 49 PM

Now if any of the above 7 taxa is contained in the csv (either directly or if it is a descendant of a taxon in the csv), we would like to flag this taxon as "potential contaminant". This could be done via a color-code or a little icon on the side that displays upon mouse-over: "potential contaminant".

+++

That is the start. Going forward we are going to provide additional csv files to indicate different taxon flags (out of "potential contaminant", "not host-associated", and "body-site typical"). I am wondering whether adding another multi-column to the signature view, which currently contains the NCBI IDs and the Links, would be a good solution.

Screen Shot 2023-02-14 at 2 00 33 PM

The third column here would then be named "Quality Control" and would have for each taxon either zero, one, or more of the tags "potential contaminant", "not host-associated", and "body-site typical" (with each of them having a corresponding icon and displaying the tag upon mouse-over).

tosfos commented 1 year ago

Thanks! That is very helpful. We'll see what we can come up with.

The Study-level quality control is undergoing technical review. We expect to apply it to production this week.

lwaldron commented 1 year ago

I was thinking about two options that you gave before:

  1. Allow users to manually trigger the tests by clicking a "Validate signatures" (or similar) button. This will bring up another page with the results of the tests.
  2. Create a script that will browse the wiki every so often. It will perform the tests required and will edit any page where tests fail so that it includes the warning.

Honestly both of these would be great - option 1 allows a curator entering a signature to do a sanity check while they wait, option 2 makes sure that the test gets applied site-wide and all signatures get the required annotations. But of the two, option 2 is the more important use case. Note that tests only need to be applied when a signature is added or changed, or in what should be the very occasional instance when the taxa in the tests are changed. I'd also note that this will be something of a "killer app" - there's no other way to easily get this kind of quality check on signatures, and annotating so much of the literature with it will be really interesting.

tosfos commented 1 year ago

Due to the recent performance improvements, we're actually going to try the previous plan of performing these tests in real-time at every page load. We believe these can be done without a significant performance decrease. But we'll fall back to option 2 if needed.

tosfos commented 1 year ago

The Study-level quality control is undergoing technical review. We expect to apply it to production this week.

This was applied to production but we still need to tweak things and run some scripts to get this feature working. You may see degraded performance today.

tosfos commented 1 year ago

We have a draft version of the taxon-level tests in our development environment. Here are some notes:

  1. We're doing run-time tests on the taxa. Performance seems OK.
  2. Instead of importing the CSV, we just used column B (the NCBI ID) from the file, which is all we really need. Since it was only one column, we decided to simply store the values in a wiki page. This will allow your administrators to modify the list easily. We can create additional pages for each list. A screenshot is below.
  3. We're showing the NCBI ID in the tooltip. We can remove it if that is not helpful.

image image

tosfos commented 1 year ago

As you can see, for now, we have a simple icon instead of a separate column. If we switch to the shorter "potential contaminant" text, we might be able to just show a single icon with (where needed) multiple issues. If we switch to an additional column, we can show different types of icons for each issue (I assume we won't have dozens of lists), or a check for taxa that pass all tests. We don't need to make a decision on this until additional lists are ready.

Please let me know once this is ready for production.

lgeistlinger commented 1 year ago

It looks great! Does this already include all descendants of these NCBI IDs in the NCBI Taxonomy? And would updates to the ContaminationList automatically also add descendants? That means if an admin would add an NCBI ID of a genus to the contamination list - would NCBI IDs of all species annotated to that genus automatically be added to the ContaminationList upon clicking "Save Changes"?

tosfos commented 1 year ago

Yes, that should all work, though I can't say it's fully tested yet.

Note that we're not actually querying for and storing all descendants of a genus. We're only using the information contained in the site. But by definition, if a species exists in the site, we already know which genus it belongs to. So if you add the genus ID, all species in the site will be tagged.

Should we apply it to production?

lwaldron commented 1 year ago

Should we apply it to production?

I'd say go for it! Stakes are low for the next little while before we resubmit the manuscript, and it would be great to see it in practice, and implement all our lists of taxa to be tagged.

tosfos commented 1 year ago

This was applied to production. You can see an example here.

The list can be edited here.

tosfos commented 1 year ago

We're going to add a help page here for documentation. Are there additional lists we should use for the taxon-level alerts? If not, this can probably be closed for now.

lgeistlinger commented 1 year ago

Cool, thanks! Yes there are two additional lists:

  1. the list of not-host associated taxa (the comment lines starting with a hashtag can be ignored): not-host.csv

  2. the list of body-site typical taxa: would it be possible to directly take them from https://bugsigdb.org/Study_562 and https://bugsigdb.org/Study_608?

In both cases, we would like to also include all descendants of these taxa in the NCBI taxonomy. For the list of body-site typical taxa, these are given by body site in the above Study pages. So as @lwaldron suggests, we would not throw them all in one pot, but rather display a message "prevalent in [body site]" depending whether this taxon (or a descendant of a taxon) is included in the signatures for that body site.

(This would then also mean that we would like updates to https://bugsigdb.org/Study_562 and https://bugsigdb.org/Study_608 to propagate to the QC flags).

tosfos commented 1 year ago

List 1 seems very similar to what we did for the first list.

List 2 is definitely complicated but I think we can do that.

tosfos commented 1 year ago

To confirm that I understand this, for list 1, any ID listed there should trigger an alert any time it shows up in the db. Is that correct? What should be the text of the alert?

I should mention that since this is based on taxa rarely seen in BugSigDB, we should be able to dynamically calculate this list instead of using manual curation. Would that be preferred?

tosfos commented 1 year ago

For list 2, we'll check each Experiment in the database and check the associated body site. We'll then check the 2 Study pages you mentioned for a matching body site. If any of this Experiment's taxa are listed in the matching body site of these 2 pages, that would trigger an alert for that taxon.

For example, in Study 637, Experiment 1 has a body site of "Oral cavity". Signature 1 in that experiment includes "Streptococcus". Since "Streptococcus" is listed in Study 608 Experiment 1 (which also has a body site of "Oral cavity"), Signature 1, we should show on alert next to "Streptococcus" in Study 637, Experiment 1, Signature 1 that says "prevalent in Oral cavity".

Is that correct?

lgeistlinger commented 1 year ago

List 1 seems very similar to what we did for the first list.

Yes basically the same functionality.

To confirm that I understand this, for list 1, any ID listed there should trigger an alert any time it shows up in the db. Is that correct?

Yes. (And also for all descendants of that ID in the NCBI Taxonomy)

What should be the text of the alert?

"not host-associated"

I should mention that since this is based on taxa rarely seen in BugSigDB, we should be able to dynamically calculate this list instead of using manual curation. Would that be preferred?

That has been discussed and we actually came to the conclusion that "rarely in BugSigDB" would be yet another category. The semantics are indeed somewhat different: (i) "not-host associated" - these are taxa made of exotic clades like thermophiles and things found in ocean water, whereas (ii) "rarely in BugSigDB" - these could be taxa like certain fungi which are certainly host-associated, but are rarely in BugSigDB, simply due to the fact that we haven't curated much studies investigating fungal constituents of the microbiome (= the mycobiome).

lgeistlinger commented 1 year ago

For list 2, we'll check each Experiment in the database and check the associated body site. We'll then check the 2 Study pages you mentioned for a matching body site. If any of this Experiment's taxa are listed in the matching body site of these 2 pages, that would trigger an alert for that taxon.

I think we were thinking of a simpler solution:

For list 2, we'll check each Experiment in the database irrespectively of the associated body site. We'll then check the 2 Study pages. If any of this Experiment's taxa are listed in one or more body sites of these 2 pages, that would trigger one or more alerts for that taxon.

That means an experiment investigating eg "feces" would still trigger an alert if it would contain a taxon that is prevalent in eg "skin" (according to https://bugsigdb.org/Study_562 and https://bugsigdb.org/Study_608). Note that a taxon might trigger more than one prevalence alert if contained in more than one body site of https://bugsigdb.org/Study_562 and https://bugsigdb.org/Study_608. Maybe these alerts can be collected to an alert of the form "prevalent in skin, oral cavity, and feces" if that is a taxon that is listed as prevalent in these three body sites.

@lwaldron can weigh in if I am mistaken.

lwaldron commented 1 year ago

Agreed - all tests can be applied regardless of body site or host species.

I am thinking that we should be consistent about how the lists are stored, not some as special lists like we have used for autocomplete values, while others are bugsigdb Signatures. Could we use orphaned signatures, or a somewhat ad-hoc Study with Experiments and Signatures for the other lists so that all can be edited through the signature form? This would have some advantages of uniform display and editing methods, and automatic linking to wherever else these taxa exist in the database. I can imagine that at some point in the future as the wiki is used for more different host species we may want to add more lists of taxa to be flagged, and adding new experiments to an existing study would be a clean way to organize those on our side.

Also, small point but I would say https://bugsigdb.org/Study_562 (adults) is sufficient, and that we don't need to flag signatures from https://bugsigdb.org/Study_608 (children).

tosfos commented 1 year ago

Could we use orphaned signatures

I wouldn't want to introduce this concept, since that will show up as needing cleanup.

a somewhat ad-hoc Study with Experiments and Signatures for the other lists so that all can be edited through the signature form

This seems very hacky. There is no actual experiment being performed, and all of the experiment fields will need to be blank. In some ways it will be a similar editing experience to what we have for Signatures, but in other ways it seems confusing. I'd recommend that we create a simple template & form for lists of signatures that can also leverage autocomplete. Would that work?

lwaldron commented 1 year ago

Yes, that makes sense and would work. The data entry and display features for signatures are great, and it would be nice to take advantage of them.

tosfos commented 1 year ago

We're having trouble figuring out a good solution for a form for the Signature lists that won't degrade performance. I'll update when I have more informaiton.

lgeistlinger commented 1 year ago

Hi @tosfos - I know there have been a flood of requests due to the Outreachy initiative and many new curators, but I just wanted to check whether there is any update on this one. I think @lwaldron agrees that this one is still our request of highest priority given that it is the last remaining item for re-submitting a revised version of the manuscript and it is getting about time for us to resubmit.

tosfos commented 1 year ago

We discussed a few options for how we can create the form for the list entry. We concluded that the best option is to query the NCBI in real time. I don't think that can be completed with a tight deadline. Maybe we should push off this feature in favor of getting this done quickly.

Which portion of this is required as highest priority? Adding the 2 additional lists? If so, should we implement it the quickest way and keep in mind that we may want to redo it later?

lgeistlinger commented 1 year ago

Maybe we should push off this feature in favor of getting this done quickly.

This sounds like a good idea to me.

Which portion of this is required as highest priority? Adding the 2 additional lists?

Yes. In agreement with the comment from the reviewer we would like at a minimum showcase the feature on the particular study/experiment/signatures that the reviewer points out. This is Study 199 / Experiment 1.

If we were able to arrive at a prototype that highlights the taxa in here that according to our lists are contaminants or not host-associated, we could show the reviewer that we have a plan and a prototype implementation - and we can then provide arguments why the full implementation will require more time than the revision time period provides for.

If so, should we implement it the quickest way and keep in mind that we may want to redo it later?

This sounds like a good idea to me.