Add wrapper for agglomerateByRank/mergeRows

Daenarys8 commented 1 year ago

There are 2 different methods doing similar (merging/grouping rows/features);

mergeRows and agglomerateByRank This wrapper method that combines these two methods.
It would be used internally by other functions so it would not be available directly for user.

antagomir commented 1 year ago

Status of this?

Daenarys8 commented 1 year ago

Status of this?

Complete for now.

antagomir commented 1 year ago

Great!

We had elsewhere discussion about function naming.

I would suggest to update as follows:

mergeRows -> mergeFeatures (see issue #392)
agglomerateByRank -> mergeFeaturesByRank (logical to use similar naming in both functions)

Downside of "merge" term is that it is also used in another meaning, to combine full SE objects, data.frames, matrices etc (instead of within object features as here). Another option would be: agglomerateFeatures and agglomerateFeaturesByRank but that is slower to write.. or combineFeature / combineFeaturesByRank (but combine is also used in the same way than merge, between distinct DataFrames)

Shall we deal with these issues here in this closely related PR, or should we open a new separate issue & PR?

TuomasBorman commented 1 year ago

Great!

We had elsewhere discussion about function naming.

I would suggest to update as follows:
* mergeRows -> mergeFeatures (see issue [Deprecate mergeRows/mergeCols #392](https://github.com/microbiome/mia/issues/392))

* agglomerateByRank -> mergeFeaturesByRank (logical to use similar naming in both functions)
Downside of "merge" term is that it is also used in another meaning, to combine full SE objects, data.frames, matrices etc (instead of within object features as here). Another option would be: agglomerateFeatures and agglomerateFeaturesByRank but that is slower to write.. or combineFeature / combineFeaturesByRank (but combine is also used in the same way than merge, between distinct DataFrames)

Shall we deal with these issues here in this closely related PR, or should we open a new separate issue & PR?

I also think mergeFeatures and mergeFeaturesByRank are better than agglomerate*, and it is good to have same naming convention.

We discussed also about enabling rowData variables in mergeRows. Currently, the groups must be fed as a vector (f parameter) but I think it would be nice to be able to specify the grouping variable directly from rowData as a column name. (Same for mergeCols)

There must be some reason why we didn't implement this before??? Do you remember @antagomir?

I think this PR can be merged, and these new modifications can be done in different PR. @Daenarys8 can you implement these?

antagomir commented 1 year ago

I agree. I don't think there is a specific reason, the rowData variables could be called by name as well.

If @Daenarys8 can open a new issue and PR about that it would be great.

You can close this issue when you have checked that it is ready.

Daenarys8 commented 1 year ago

Great! We had elsewhere discussion about function naming. I would suggest to update as follows:
* mergeRows -> mergeFeatures (see issue [Deprecate mergeRows/mergeCols #392](https://github.com/microbiome/mia/issues/392))

* agglomerateByRank -> mergeFeaturesByRank (logical to use similar naming in both functions)
Downside of "merge" term is that it is also used in another meaning, to combine full SE objects, data.frames, matrices etc (instead of within object features as here). Another option would be: agglomerateFeatures and agglomerateFeaturesByRank but that is slower to write.. or combineFeature / combineFeaturesByRank (but combine is also used in the same way than merge, between distinct DataFrames) Shall we deal with these issues here in this closely related PR, or should we open a new separate issue & PR?
I also think mergeFeatures and mergeFeaturesByRank are better than agglomerate*, and it is good to have same naming convention.

We discussed also about enabling rowData variables in mergeRows. Currently, the groups must be fed as a vector (f parameter) but I think it would be nice to be able to specify the grouping variable directly from rowData as a column name. (Same for mergeCols)

There must be some reason why we didn't implement this before??? Do you remember @antagomir?

I think this PR can be merged, and these new modifications can be done in different PR. @Daenarys8 can you implement these?

Sure, will give it a shot

antagomir commented 1 year ago

Can someone close this PR when it is confirmed to be ready.

antagomir commented 1 year ago

Was this ever merged? If I check correctly from above, it was just closed without merging ? The idea was to merge I guess?

antagomir commented 1 year ago

Hmm - - ok now I noticed that the renaming scheme discussed above, and agreed, has not made it to this PR yet.

Can we add it (@Daenarys8) ?

More specifically, the idea was to rename "Rows" to "Features" and "Cols" to "Samples".

In addition, to harmonize terminology (use "merge" instead of "agglomerate").

However, one last thing to discuss first: the meaning of "to agglomerate" is somewhat better fit with our case (in terms of language & meaning), in particular when the phylogenetic tree is involved in the process. The phyloseq equivalent to TreeSE agglomerateByRank is tax_glom (e.g. https://mikemc.github.io/speedyseq/reference/tax_glom.html).

We could use "glom" instead of "merge". That would be as fast to write, and the meaning would be more specific (merge is easier to confuse with merging of matrices). However, "glom" might be a bit weird term to introduce right now, and it is possible to re-evaluate that later as well.

Hence, in summary, I suggest to rename as:

mergeCols -> mergeSamples (or glomSamples... perhaps not?)
mergeRows -> mergeFeatures
agglomerateByRank -> mergeFeaturesByRank
agglomerateByPrevalence -> mergeFeaturesByPrevalence

@Daenarys8 could you add that to this PR as discussed above, OR open a new issue proposing this change, then merging and closing the current PR. Also confirm that the other points discussed / agreed above have now been addessred.

antagomir commented 1 year ago

Ok it is more clear to do the other discussed changes in a separate PR. I am merging and closing this one.

antagomir commented 1 year ago

Ok - now as this is merged:

the original motivation was ANCOMBC issue #174 where we would like to let users specify taxonomic rank or also other rowData variable to merge rows before running ANCOMBC: https://github.com/FrederickHuangLin/ANCOMBC/issues/174

Now, ideally this new wrapper will help there; it would group by taxonomic rank if this is available in rowData, otherwise it uses the more general grouping.

microbiome / mia

Add wrapper for agglomerateByRank/mergeRows #389