agglomerate* functions: behavior when NA

Leo;

1.agglomerateByPrevalence: Indeed, it seems that your analysis is correct; the "Other" group only contains 0's and 1's. Then the system is not sure if these are actual counts that are sensible to sum up, and throws a warning. In principle this works as expected. In practice, we already know from the context that it is actual count data because we could test that the original mae[[1]] is count data, and hence any subset of it (those that will be merged under the "Other" category) will also be. This could deserve a small fix that would check the "count" status in such cases for the original input only. This will require thinking a bit about the logic of the method.

I noticed also another point; agglomerating by rank or prevalence will give different total read counts per sample, although they would be expected to give identical count (just different grouping of the rows).

colSums(assay(agglomerateByRank(mae[[1]], rank = "Phylum")))
colSums(assay(agglomerateByPrevalence(mae[[1]], rank = "Phylum")))

This is because the Phylum rank includes NAs for some rows: sum(is.na(rowData(mae[[1]])$Phylum)) yields 93. These are omitted with agglomerateByPrevalence but not with agglomerateByRank (they will be included as NA row in the latter). It would be most logical that the NA row would be included also in the data that is agglomerated by prevalence. The user can choose whether they want to merge such NA row further. One problem with the NA row is that these may come from different phyla, and hence grouping them together in the phylum level agglomeration is potentially misleading. I would solve this by providing a binary argument that excludes the NA phyla by default in all agglomerations (rank, prevalence, or other grouping variable) but user could choose to keep these by switching the argument (then they are aware of this and can maintain the original read count, which might be relevant in some cases).

microbiome / mia

agglomerate* functions: behavior when NA #411