microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
47 stars 28 forks source link

Summarize TSE object #69

Closed antagomir closed 2 months ago

antagomir commented 3 years ago

We could add a summary function that could generate / print out basic summaries of a TSE object for microbiome data.

This would be roughly equivalent to microbiome::summarize_phyloseq.

FelixErnst commented 3 years ago

Sure. Can you prepare an PR?

What would be the ouput type? From looking at it right now a small data.frame would probably be best

antagomir commented 3 years ago

A data.frame would be best I think, too.

We can prepare PR later. We could start prioritizing issues because there are too many PRs to do at the same time.

FelixErnst commented 3 years ago

I think this should be added to getTopTaxa since this is a summarizing function as well. Maybe the man page needs to be reworked to reflect the intention of the shown functions better, but the foundation for summarizing data is already available.

@microsud

FelixErnst commented 3 years ago

@antagomir @microsud any new insides on this? I think a set of summarization functions should be most applicable. summarizeTaxa would describe the idea behind the three linked issues the best.

Do you agree that getDominantTaxa can be reworked to summarizeDominantTaxa? On the same man page summarizeTaxa could reside as well.

antagomir commented 3 years ago

I am ok with this at least. It is just useful to have 1) easy way to pick up dominant or "top" taxa (various criteria could be possible, considering abundance or prevalence features): and 2) concise summaries as has been implemented by @microsud earlier in phyloseq context.

microsud commented 3 years ago

I will work on this in the coming days, was avoiding updating the R version on my system :P Some comments: The getDominantTaxa parts of which are inspired from microbiomeutilties::dominant_taxa can be renamed as summarizeDominantTaxa as it returns a tibble with overview. On the other hand, dominantTaxa adds a column to colData and therefore I think this should be renamed to getDominantTaxa / addDominantTaxa. What are your thoughts? @tvborm because you wrote this function, do you have time to make these changes? The getTopTaxa technically returns the most dominant/abundant taxa based on all samples using the option sum, the dominantTaxa function is per/sample which may not be useful option to add to getTopTaxa.
Regarding summarize tse, the print(tse) returns a good amount of information. I have some basic idea for adding relevant information to summarizeTSE that I will test and let you know.

microsud commented 3 years ago

I am thinking of how best to make summaries. Only raw counts should be allowed because that is what is meaningful.

  1. Overall based on all samples and all features counts image In any publication one would like to mention these basic stats.

2a. Per sample information image

2b. Per feature information image image

These IMO are information that one would by default need to have before starting analysis. Also returning a tibble means one can directly save these tables during their analysis using basic write funs or make plots like below: image

Let me know your thoughts.

antagomir commented 3 years ago

Seems good to me at least for now. We can later add more if this turns out to be necessary.

In principle, the summary could also summarize taxa & metadata features in the same style as glimpse() does but perhaps glimpse(colData()) or glimpse(rowData()) is already enough.

TuomasBorman commented 3 years ago

The getDominantTaxa parts of which are inspired from microbiomeutilties::dominant_taxa can be renamed as summarizeDominantTaxa as it returns a tibble with overview. On the other hand, dominantTaxa adds a column to colData and therefore I think this should be renamed to getDominantTaxa / addDominantTaxa. What are your thoughts? @tvborm because you wrote this function, do you have time to make these changes?

Yep, sounds good, I can make those changes

FelixErnst commented 3 years ago

Regarding the sample wise/feature wise summarization: Please have a look at addPerCellQC/addPerFeatureQC and check, which values are missing from that output. If it is just a few, one could use the output of those functions as a basis and add additional ones.

It might also be a good idea to follow the nomenclature. add* returns a changed input with values added to the approproate metadata dimension (colData/rowData). However, normaly those functions are just wrappers around the actual work horse (In the example above this would be perFeatureQCMetrics. For our purposes this would be a function named dominantTaxa for the wrapper addDominantTaxa).

With the data added to the rowData/colData plotting is then streightforward using plotRowData/plotColData

FelixErnst commented 3 years ago

Just as thought maybe this could all be tied together via a single man page, maybe under the name qc-functions

TuomasBorman commented 3 years ago

So, if I understood correctly

1 There should be additional R/Rmd files (qc-functions) for summarization functions.

2 getDominantTaxa.Rshould include

I'm just wondering if I can modify that getDominanttTaxa and get this little bit forward

antagomir commented 3 years ago

Yes I think it is ok to do this way.

microsud commented 3 years ago

I am working on general-purpose summarization functions. You can start by making summarizeDominantTaxa function, in summaries.R, after that I will add summarizeSE to that same file. These are basic summaries. Then there are QC related functions such as perFeature and perSample metrics which I will add to qc-functions.R. @FelixErnst and @antagomir okay with this plan?

antagomir commented 3 years ago

Ok to me for now. Good to move forward.

FelixErnst commented 3 years ago

Sounds good. Feel free to make adjustment, when you encounter better options. We can discuss it in the subsequent PR.

FelixErnst commented 3 years ago

@microsud do you want to continue with microbiome specific metric functions?

microsud commented 3 years ago

Yes as soon as find time. Can we keep this open?

FelixErnst commented 2 years ago

Any news on this?

antagomir commented 2 years ago

@microsud any change to have a look before October release?

microsud commented 2 years ago

I think this available as summary() in Mia.

antagomir commented 2 years ago

Right, there has been good progress.

There are several points that I now picked from the above discussion that are not there. I think it would be good to make after all this discussion an informed decision whether some of these are ignored, or should we still include them.

The list from above contains at least the following points:

  1. Moving the summaries into qc_functions.R

  2. Summary functions perhaps separately for the 1) full object (SE + TreeSE); 2) colData, 3) rowData; 4) other components or is full data summary enough? -> Also compare to scater QC functions addPerCellQC / addPerFeatureQC. Use the add* nomenclature where relevant (e.g. dominantTaxa vs. addDominantTaxa)

  3. Binding the manpages of the summary functions together

  4. Ways to pick "dominant" taxa (this seems to be now implemented in dominantTaxa.R) -> OK?

I think these are good points, if they have been addressed then let us make a note or decision here.

If you notice something else from above or otherwise kindly add.

microsud commented 2 years ago

I had a look at the perFeatureQCMetrics and perCellQCMetrics again. These are now part of the scuttle package as utilities for SingleCellExperiment and used by mia . We now provide a method for prevalence as a separate function. Initially, I thought adding a coefficient of variation per feature would be useful. However, it is hardly used in the analysis. So I am hesitant to add a new QC function on top of perFeatureQCMetrics. The CV calculation is a straightforward one-liner and can be shown in OMA if we find a use case for it. IMO the summaries in the summary file make sense at the moment because none of them are QC functions per se. The rationale behind two separate files for dominant taxa is because of their usage (I think). @TuomasBorman may have a better insight here.

Cheers, Sudarshan

TuomasBorman commented 2 years ago

Two separate files:

antagomir commented 2 years ago

Thanks a lot! Update:

  1. Moving the summaries into qc_functions.R -> Indeed we have more like summaries rather than quality control here.

-> Suggestion: No changes needed, case closed.

  1. Summary functions (vs. scuttle::per*QCMetrics). For clarity I think it would be justified use distinct summary functions for microbiome data. These can be easily extended when needed, naming can be more intuitive, and these should accept SE but also support TreeSE so that we can later include summaries for tree information.

a) full data object (Tree)SE We have now: summary -> OK as such I guess, no action points.

b) colData (similar to perCellQCMetrics) -> We could have perSampleSummary / perColumnSummary / colSummary or similar. -> But perhaps the current summary is sufficient as is?

c) rowData (similar to perFeatureQCMetrics) We have now: getTopTaxa / getTopFeatures& getUniqueTaxa / getUniqueFeatures & countDominantTaxa / countDominantFeatures -> We could additionally have perFeatureSummary / perFeatureSummary / rowSummary or similar. -> But perhaps the current summary is sufficient as is?

-> However picking on Sudarshan's point on having a summary per rows and per cols might be a feasible option, instead of having separate functions for prevalence, CoV, dominance, mean abundance etc. one could just have one summary table with all this information, and user could pick what they need from there. Or we could have both, such summary function could call the more specific functions to provide a full overview of the data. However this would be additional work, not sure how necessary.

-> Suggestion: Open to discussion, curious to hear your thoughts. If this seems useful we could implement row/col-wise summaries. Otherwise, keeping things as is and keep this in mind.

  1. Use the add* nomenclature where relevant (e.g. dominantTaxa vs. addDominantTaxa)

-> Suggestion: check so that we stay consistent.

  1. Binding the manpages of the summary functions together

-> Suggestion: Could be useful, to check if it makes sense.

  1. Ways to pick "dominant" taxa (this seems to be now implemented in dominantTaxa.R)

-> Suggestion: to check that this is sufficiently well documented as functions / examples and naming is consistent with other functions (prevalence etc)

antagomir commented 2 years ago

Follow-up for Tuomas comments: would it then be justified to have separate summaries for row, col, and full data.

The full data summary could be more compact overview of the full data, for more details one could refer to row/col-wise summaries. The main purpose of these is that the user can have a quick understanding of the data contents but it does not hurt if one can also pick some useful material from these summaries for downstream analyses.

In addition there are specific "summaries" like getPrevalentTaxa etc. that can or cannot be added to row/col data, and used for downstream analyses from there. I think we need these and they are not a replacement for generic row/col/fulldata summary even if there is some overlap.

antagomir commented 2 months ago

We have nowadays some summary functions. I am closing this until there will be again more concrete suggestions arising.