Closed antagomir closed 2 months ago
Sure. Can you prepare an PR?
What would be the ouput type? From looking at it right now a small data.frame
would probably be best
A data.frame would be best I think, too.
We can prepare PR later. We could start prioritizing issues because there are too many PRs to do at the same time.
I think this should be added to getTopTaxa
since this is a summarizing function as well. Maybe the man page needs to be reworked to reflect the intention of the shown functions better, but the foundation for summarizing data is already available.
@microsud
@antagomir @microsud any new insides on this? I think a set of summarization functions should be most applicable. summarizeTaxa
would describe the idea behind the three linked issues the best.
Do you agree that getDominantTaxa
can be reworked to summarizeDominantTaxa
? On the same man page summarizeTaxa
could reside as well.
I am ok with this at least. It is just useful to have 1) easy way to pick up dominant or "top" taxa (various criteria could be possible, considering abundance or prevalence features): and 2) concise summaries as has been implemented by @microsud earlier in phyloseq context.
I will work on this in the coming days, was avoiding updating the R version on my system :P
Some comments:
The getDominantTaxa
parts of which are inspired from microbiomeutilties::dominant_taxa can be renamed as summarizeDominantTaxa
as it returns a tibble with overview. On the other hand, dominantTaxa
adds a column to colData
and therefore I think this should be renamed to getDominantTaxa
/ addDominantTaxa
. What are your thoughts? @tvborm because you wrote this function, do you have time to make these changes?
The getTopTaxa
technically returns the most dominant/abundant taxa based on all samples using the option sum
, the dominantTaxa
function is per/sample which may not be useful option to add to getTopTaxa
.
Regarding summarize tse, the print(tse) returns a good amount of information. I have some basic idea for adding relevant information to summarizeTSE
that I will test and let you know.
I am thinking of how best to make summaries. Only raw counts
should be allowed because that is what is meaningful.
2a. Per sample information
2b. Per feature information
These IMO are information that one would by default need to have before starting analysis. Also returning a tibble means one can directly save these tables during their analysis using basic write
funs or make plots like below:
Let me know your thoughts.
Seems good to me at least for now. We can later add more if this turns out to be necessary.
In principle, the summary could also summarize taxa & metadata features in the same style as glimpse() does but perhaps glimpse(colData())
or glimpse(rowData())
is already enough.
The
getDominantTaxa
parts of which are inspired from microbiomeutilties::dominant_taxa can be renamed assummarizeDominantTaxa
as it returns a tibble with overview. On the other hand,dominantTaxa
adds a column tocolData
and therefore I think this should be renamed togetDominantTaxa
/addDominantTaxa
. What are your thoughts? @tvborm because you wrote this function, do you have time to make these changes?
Yep, sounds good, I can make those changes
Regarding the sample wise/feature wise summarization: Please have a look at addPerCellQC
/addPerFeatureQC
and check, which values are missing from that output. If it is just a few, one could use the output of those functions as a basis and add additional ones.
It might also be a good idea to follow the nomenclature. add*
returns a changed input with values added to the approproate metadata dimension (colData
/rowData
). However, normaly those functions are just wrappers around the actual work horse (In the example above this would be perFeatureQCMetrics
. For our purposes this would be a function named dominantTaxa
for the wrapper addDominantTaxa
).
With the data added to the rowData
/colData
plotting is then streightforward using plotRowData
/plotColData
Just as thought maybe this could all be tied together via a single man page, maybe under the name qc-functions
So, if I understood correctly
1 There should be additional R/Rmd files (qc-functions
) for summarization functions.
summarizeDominantTaxa
(currently getDominantTaxa
) should be moved there2 getDominantTaxa.R
should include
dominantTaxa
addDominantTaxa
which works as a wrapper for dominantTaxa
I'm just wondering if I can modify that getDominanttTaxa
and get this little bit forward
Yes I think it is ok to do this way.
I am working on general-purpose summarization functions. You can start by making summarizeDominantTaxa
function, in summaries.R, after that I will add summarizeSE
to that same file. These are basic summaries.
Then there are QC related functions such as perFeature
and perSample
metrics which I will add to qc-functions.R
.
@FelixErnst and @antagomir okay with this plan?
Ok to me for now. Good to move forward.
Sounds good. Feel free to make adjustment, when you encounter better options. We can discuss it in the subsequent PR.
@microsud do you want to continue with microbiome specific metric functions?
Yes as soon as find time. Can we keep this open?
Any news on this?
@microsud any change to have a look before October release?
I think this available as summary() in Mia.
Right, there has been good progress.
There are several points that I now picked from the above discussion that are not there. I think it would be good to make after all this discussion an informed decision whether some of these are ignored, or should we still include them.
The list from above contains at least the following points:
Moving the summaries into qc_functions.R
Summary functions perhaps separately for the 1) full object (SE + TreeSE); 2) colData, 3) rowData; 4) other components or is full data summary enough? -> Also compare to scater QC functions addPerCellQC
/ addPerFeatureQC
. Use the add* nomenclature where relevant (e.g. dominantTaxa
vs. addDominantTaxa
)
Binding the manpages of the summary functions together
Ways to pick "dominant" taxa (this seems to be now implemented in dominantTaxa.R) -> OK?
I think these are good points, if they have been addressed then let us make a note or decision here.
If you notice something else from above or otherwise kindly add.
I had a look at the perFeatureQCMetrics
and perCellQCMetrics
again. These are now part of the scuttle
package as utilities for SingleCellExperiment
and used by mia
. We now provide a method for prevalence as a separate function. Initially, I thought adding a coefficient of variation per feature would be useful. However, it is hardly used in the analysis. So I am hesitant to add a new QC function on top of perFeatureQCMetrics
. The CV calculation is a straightforward one-liner and can be shown in OMA if we find a use case for it. IMO the summaries in the summary file make sense at the moment because none of them are QC functions per se. The rationale behind two separate files for dominant taxa is because of their usage (I think). @TuomasBorman may have a better insight here.
Cheers, Sudarshan
Two separate files:
Thanks a lot! Update:
-> Suggestion: No changes needed, case closed.
scuttle::per*QCMetrics
). For clarity I think it would be justified use distinct summary functions for microbiome data. These can be easily extended when needed, naming can be more intuitive, and these should accept SE but also support TreeSE so that we can later include summaries for tree information.a) full data object (Tree)SE
We have now: summary
-> OK as such I guess, no action points.
b) colData (similar to perCellQCMetrics
)
-> We could have perSampleSummary
/ perColumnSummary
/ colSummary
or similar.
-> But perhaps the current summary
is sufficient as is?
c) rowData (similar to perFeatureQCMetrics
)
We have now: getTopTaxa
/ getTopFeatures
& getUniqueTaxa
/ getUniqueFeatures
& countDominantTaxa
/ countDominantFeatures
-> We could additionally have perFeatureSummary
/ perFeatureSummary
/ rowSummary
or similar.
-> But perhaps the current summary
is sufficient as is?
-> However picking on Sudarshan's point on having a summary per rows and per cols might be a feasible option, instead of having separate functions for prevalence, CoV, dominance, mean abundance etc. one could just have one summary table with all this information, and user could pick what they need from there. Or we could have both, such summary function could call the more specific functions to provide a full overview of the data. However this would be additional work, not sure how necessary.
-> Suggestion: Open to discussion, curious to hear your thoughts. If this seems useful we could implement row/col-wise summaries. Otherwise, keeping things as is and keep this in mind.
-> Suggestion: check so that we stay consistent.
-> Suggestion: Could be useful, to check if it makes sense.
-> Suggestion: to check that this is sufficiently well documented as functions / examples and naming is consistent with other functions (prevalence etc)
Follow-up for Tuomas comments: would it then be justified to have separate summaries for row, col, and full data.
The full data summary could be more compact overview of the full data, for more details one could refer to row/col-wise summaries. The main purpose of these is that the user can have a quick understanding of the data contents but it does not hurt if one can also pick some useful material from these summaries for downstream analyses.
In addition there are specific "summaries" like getPrevalentTaxa
etc. that can or cannot be added to row/col data, and used for downstream analyses from there. I think we need these and they are not a replacement for generic row/col/fulldata summary even if there is some overlap.
We have nowadays some summary functions. I am closing this until there will be again more concrete suggestions arising.
We could add a summary function that could generate / print out basic summaries of a
TSE
object for microbiome data.This would be roughly equivalent to microbiome::summarize_phyloseq.