dbBact's wordclouds for Qiita

sjanssen2 commented 8 months ago

In my collaborations, I often encounter situations where PhD candidates are volunteered by their PI's to also handle amplicon analysis, but are total microbiome newbies. Situations might be complicated, because others might have done sample collection, sequencing was outsourced, ... Without any experience, they now have to sanity check if the sequencing was successful ... any PI's too often love to take shortcuts, like expired and flowcells, excessive multiplexing, ... Sanity checking is extremely hard if not impossible without having seen many OKish datasets. I was recently contacted by Amnon and colleagues who finally published dbBact. In a nutshell: they collect expert knowledge for individual ASV sequences. I found their wordclouds (i.e. enrichted terms of ASVs in a feature-table - or more precisely: rep set) extremely helpful to characterize a sequencing run / prep without relying on the metadata, which are sadly too often wrong or incomplete. It is quite easy to see if a prep holds samples from e.g. mouse or soil or ....

Therefore, I'd like to integrate these wordclouds into Qiita and wonder what the best strategy is? Here are my thoughts:

In principle, I assume minimal knowledge / experience of the user and therefore intend to present these images prominently without much action required by the user.
dbBact is a database and will change over time, how can we assert reproducibility?
Amnon already created an API endpoint to which we can post a set of ASV sequences and receive F-score for the terms that make up the word cloud. This is handled by their server. Is it OK to rely on this external resource or better have a DB dump or flatfile in a plugin?

https://dbbact.org/sequences_fscores

to use it, just supply the json parameter 'sequences' which is a list of the sequences (ACGT string that start from one of the supported dbbact regions)

example:
seqs=['TACGGAGGGTGCAAGCGTTGTCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCTTTTTAAGTCTGGGGTGAAAGCCCGTTGCTCAACAACGGAACTGCCCTGGAAACTGGAGAGCTTGAGTACAGACGAGGGTGGCGGAATGGACGG']
res=requests.get('https://dbbact.org/sequences_fscores',json={'sequences': seqs})
print(res.json())

we could blow up the qp-deblur plugin to also produce wordclouds, but I am rather hesitant, as they are not really a processing result
if we implement the word-clouds as a new plugin, how can we ensure that users will actually perform this action? Maybe as part of a default workflow?
is it worth to cache the generated images or would on demand API calls suffice?
where in Qiita to present the results? Here are three suggestions
1. show a word cloud prominently at the study summary page for every 16S/18S prep that has been processed with deblur:
2. show one word cloud below every 16S/18S prep that has been processed with deblur:
3. strictly sticking to the plugin architecture and show as visualization summary for the according artifact:
how would we handle database updates?
- deprecate the existing word clouds and automatically generate new ones?
- present a sync option to the user to manually trigger updates?

I'd be happy to know your opinion @antgonza before I start implementing. Thank you!

antgonza commented 8 months ago

Hi @sjanssen2, thank you for your question - this is an interesting one!

First, would you consider this a processing or analytical tool? By processing I mean transforming (raw) sequences to feature tables or analytical something that you apply to the feature table downstream your analytical pipeline?

Based on the description provided here, I think is more analytical as it get's applied to the deblur results - mainly a feature table; what do you think? If you agree, then this plugin should be part of the analysis and not the processing as the 3 options you present show.

All analyses are done via QIIME 2 so if you would like to add it you would need to have a q2 plugin and then add it to the qiita QIIME 2 plugin, like this. In case it helps here are a couple of examples of how to add QIIME 2 plugin to the qiita QIIME 2 plugin:

umap: this plugin parameters followed all basic naming conventions for its inputs/parameters/outputs so nothing had to change (just added the plugin) and also due to the input type it was possible to add an actual test.
mislabeled: this added a new type requirement so the changes are a bit more extensive; however, we couldn't add an actual test for the method.

If this is the route you prefer, your plugin could simply "summarize" (create a qzv) via dbBact a feature table.

Hope this helps.

antgonza commented 5 months ago

Closing for now, please reopen if you have further questions.

qiita-spots / qiita

dbBact's wordclouds for Qiita #3380