Naming of the cluster function

antagomir commented 1 year ago

The cluster function returns a \code{SummarizedExperiment} with clustering information in its colData or rowData:

cluster(... , MARGIN="features")
cluster(... , MARGIN="samples")

This could then be used to visualize the clusters on ordination plots, or to merge rows/cols e.g. with mergeFeatures / mergeSamples.

Usual connotation of "clustering" would be that it groups elements together; this could also include merging but the current cluster function does not merge rows/cols, it only adds information in rowData / colData.

In this sense it is similar to functions like:

addPerSampleDominantTaxa (adds information on dominant taxon per sample to colData)
estimateDiversity (estimates alpha diversity and adds it to colData)

We also have:

perSampleDominantTaxa (checks dominant taxa per sample, and returns a named character vector with this info) -> this only calculates dominant taxon but does not add to colData (as in addPerSampleDominantTaxa which extends this function)

There is clearly some divergence in the naming schemes here.

I suggest the following naming scheme (via deprecation of the old names):

cluster -> addCluster
addPerSampleDominantTaxa -> addPerSampleDominantFeature (taxa -> feature, as agreed elsewhere)
estimateDiversity -> addDiversity (same for other estimate* functions: diversity, richness, evenness, dominance)

However, if these functions always add information to row/col data, then we can ask if the add* is unnecessary term. Could also be, for instance, just:

cluster -> cluster
addPerSampleDominantTaxa -> dominantFeature
estimateDiversity -> diversity

In addition, we can consider whether functions like perSampleDominantTaxa need to be exported. This returns the information before it is being added to colData. If we export these functions, then it would be good to do systematically and also have the same available for diversities and clusters.

Not sure what works best but the naming conventions may vary a bit too much at the moment.

BananaCancer commented 1 year ago

I would have a preference for the latter option: removing the add in front, but I think it's also better to have an action/verb in the function name. cluster works but dominantFeature and diversity might lack clarity as to what their purpose is. So if the latter two could be renamed it would be nice. Otherwise, the option of adding add in front of the functions could work.

As for the export of functions like perSampleDominantTaxa, I'm not sure it's necessary. It could be confusing to have plenty of functions with almost the same purpose from a user point of view. If the user wants to access the data, he can just access it from the TSE. If need be, maybe a parameter could be added to handle this. For example, a storeData parameter, that stores data if set to true and doesn't if not.

antagomir commented 1 year ago

I agree with the latter point, i.e. to have just one function for the task.

I sort of like the idea of adding a handle that could store the data, or not store. However, this could be added later if appears to become relevant.

If there will be no separate functions for just calculating and returning stuff vs. adding them to colData, then there is also less need to indicate this in the function name. Then "addX" would be redundant, and even if misleading if the suggested handle will be added later.

Then suitable, hopefully final, function names could be:

cluster -> cluster or getCluster (I agree that just "cluster" is quite nice; unless we like to have similar prefix for all functions that get some per-sample information)
addPerSampleDominantTaxa -> getDominant or getDominantElement (for a margin-agnostic name)
estimateDiversity -> getDiversity (no need to change other than to harmonize the naming scheme)

I could even be in favor of extending the dominantFeature & diversity functions to cover both row and col margins.. no fundamental reasons why not; the default margins would matter, though (should be samples).

TuomasBorman commented 1 year ago

I agree that names could be better. This thing is related to overall unifying of functions that we have discussed earlier. Might be best to first list all the functions and then start to think the naming convention across the ecosystem (mia vs other mia packages vs SCE)

I think one function is enough for each task. On the other hand, scater has one function for storing and one for returning just the result (just like we have currently). I think that is why we have them, but we can make our own decision.

antagomir commented 1 year ago

Ok, no renaming for now but I am opening a separate issue on that and it would be relevant to deal with relatively soon if updates are done. Preferably before the next Bioc release in Oct.

microbiome / mia

Naming of the cluster function #402