microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
46 stars 27 forks source link

addPerSampleDominantFeatures / perSampleDominantFeatures new argument? #420

Closed antagomir closed 4 months ago

antagomir commented 1 year ago

We frequently need to retrieve top features for analysis and visualization purposes.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns

## Add dominant feature per sample to colData
tse <- addPerSampleDominantFeatures(tse, name="topfeats")

## Check the identified top features per sample
## note that this is a list if there are multiple dominant features per sample
tse$topfeats

Often we like to focus on the most dominant features, and treat the rest (more rare) features in an "Other" category.

This can now be done e.g. as follows:

## If there are multiple dominant features by chance, then
# pick one of those per sample at random
tops <- unname(sapply(tse$topfeats, function (x) {sample(unlist(x),1)}))

# Identify the top features
# the microbiome::top function could be
# moved to mia as internal function (or even exported)
top <- microbiome::top(tops, n=5)

# Group the rest into the "Other" category
tops[!tops %in% names(top)] <- "Other"

# Also store in the main colData 
tse$topfeats2 <- factor(tops)

It would be however very handy for practical wrangling if one could just write:

tse <- addPerSampleDominantFeatures(tse, n=5, name="topfeats2", other_name="Other")

and this would achieve the same. Same comment would apply to the perSampleDominantFeatures function.

This could throw a warning if the dominant feature per sample is not unique.

antagomir commented 1 year ago

On the same go, these functions might have an argument to pick one element per sample at random, if there are multiple dominant features per sample. Warning should be always thrown.

I noticed that whereas addPerSampleDominantFeature(tse) returns a list with the length of nrow(tse) but perSampleDominantFeature(tse) returns a vector with the length of total occurrences of dominant features, sometimes this is larger than nrow(tse) (when there are multiple dominant features per sample). It could be useful to have a similar output for both cases (e.g. the full list) and then optional argument to shrink this to a vector with a single random instance chosen per sample, if multiple ones exist.