microbiome / mia

Microbiome analysis
https://microbiome.github.io/mia/
Artistic License 2.0
45 stars 25 forks source link

Fixes to agglomeration #556

Closed TuomasBorman closed 3 weeks ago

TuomasBorman commented 1 month ago

Related to this PR https://github.com/microbiome/mia/issues/419.

While polishing the code, I realized that our modifications related to na.rm and onRanksOnly were not good.

onRanksOnly in agglomerateByRank controls whether to check higher ranks or not. Basically, when onRanksOnly=TRUE, the agglomeration is done similarly to agglomerateByVariable.

Also, I think added na.rm to agglomerateByVariable was is too complicated, and does not really bring benefits.

So I suggest that we remove:

  1. Switch back to onRankOnly=TRUE by default
  2. Remove na.rm from agglomerateByVariable

This is to show behavior in this PR

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns

# Works similarly to agglomerateByVarible
tse1 <- agglomerateByRank(tse, rank = "Family", onRankOnly = TRUE)
tse2 <- agglomerateByVariable(tse, MARGIN = 1, f = "Family")
dim(tse1) # 334  26
dim(tse2) # 334  26

# Now it works differently than agglomerateByVariable since it takes into accounts also higher ranks
# Now it might be that there are duplicated rownames. Same family names might have different phyla rank.
tse1 <- agglomerateByRank(tse, rank = "Family")
dim(tse1) # 341  26

# If we set, na.rm == FALSE, we get also taxa that are in higher rank (rows might include phyla for instance)
tse1 <- agglomerateByRank(tse, rank = "Family", na.rm = FALSE)
dim(tse1) # 603  26

# This is how to get similar result by using agglomerateByPrevalence
tse1 <- agglomerateByRank(tse, rank = "Family")
tse1 <- agglomerateByPrevalence(tse1, assay.type="counts", detection  = 0.5/100, prevalence  = 20/100)
tse2 <- agglomerateByPrevalence(tse, assay.type="counts", detection  = 0.5/100, prevalence  = 20/100, rank = "Family")
dim(tse1) # 269  26
dim(tse2) # 269  26

# Or if we want to also include higher ranks
tse1 <- agglomerateByRank(tse, rank = "Family", na.rm = FALSE)
tse1 <- agglomerateByPrevalence(tse1, assay.type="counts", detection  = 0.5/100, prevalence  = 20/100)
tse2 <- agglomerateByPrevalence(tse, assay.type="counts", detection  = 0.5/100, prevalence  = 20/100, rank = "Family", agg.na.rm = FALSE)
dim(tse1) # 446  26
dim(tse2) # 446  26

@antagomir

@Daenarys8 can you solve errors in unit tests? Also, can you check that the documentation reflects these new modifications?

antagomir commented 1 month ago
  1. onRankOnly default - sounds reasonable
  2. "Remove na.rm from agglomerateByVariable": might make sense but I didnt find the reasoning why it is not good there but it would be however needed in agglomerateByRank; ideally the treatment would be as unified as possible. But deviations are ok when it is justified, of course.
  3. "agglomerateByVariable(tse, MARGIN = 1, f = "Family")" in fact the use of MARGIN is in some sense nonconventional; usually margin=1 in R means that operations are done per row but in this case the operations are done per column even if the idea is that the function operates "on the rows".. I might be in favor of changing this so that the MARGIN meaning is switched. In the end, we are here aggregating variables by sample , correct? It would be crucial to be very clear about this in the documentation and examples.
TuomasBorman commented 1 month ago

2.

I think it complicates things. The merging is done with factor and factor works like that; NA is handles as "no group" so these NA values are not grouped. Instead, now we are creating artificial "NA" group.

In the previous PR, we only added na.rm to agglomerateByVariable, agglomerateByRank already had that. However, I'm ok with that if we decide to keep na.rm also for Variable, but there were bugs in the code. When Rank is calling *Variable internally, na.rm must be hard-coded to TRUE (which was the behavior earlier before these modifications).

3.

I am not sure of the formal definition of MARGIN, but for me it is clear that MARGIN = 1 means that a) the agglomeration is applied to rows b) the variable can be found from rowData. Of course I am biased. The documentation should be modified so that it states this clearly enough. Also if that does not follow the formal definition, I guess we could stretch it little bit.

antagomir commented 1 month ago
  1. It would be useful, in principle, if user can choose whether to include "NA" group or just exclude NA cases. Being blissfully ignorant of some technicalities, I would be by default in favor of keeping na.rm as a standard argument in all those methods (perhaps with default na.rm=FALSE, which would then generate the "NA" group? But if na.rm option is too complicated then one could just include "NA" group always, and show how to remove it as a separate post processing step. However that is not equally neat solution.

  2. This seems like an important point. Consider for instance: apply(x, 1, sum) == rowSums(x). Clearly, this does the operation per each row. In our case, agglomerateByVariable(tse, MARGIN = 1, f = "Family") would operate per column. The apply family of methods is very central to the R ecosystem, I am not sure if we want to convert the margin meaning here. The other option is that we systematically align with the same conventions than apply methods have, and try to be very clear about this in all documentation.

TuomasBorman commented 1 month ago
  1. This is ok. There was a bug that I fixed.

  2. I understand your point. I think MARGIN = 2 when we want to agglomerate based on rowData is also problematic since it might be hard to understand. I don't have any clear ideas on how to solve this.

btw, @Daenarys8 can you make those tests simpler. Those that you made in last PR. They are quite hard to follow. You could check that when you agglomerateByRank, it matches with rowData(tse)$family for instance. Also same with agglomerateByVariable. Then you could check that you get same result with agglomerateByPrevalence + agglomeration and agglomerateByRank

TuomasBorman commented 1 month ago

3. I think we could change the name. MARGIN is used in other functions also so we could switch them too to harmonize. For instance, direction might be clear

antagomir commented 1 month ago

I am still not certain whether 1 vs. 2 is confusing in the longer run. I think it is also a matter of convention & getting used to it. When it aligns with other R functions (apply family of functions), then it should be relatively easy to memorize. Examples are also available for those who need to check. Message could be thrown, something like "Agglomerating rows" so users will also see what is happening.

The advantage of margin 1/2 is a more direct compatibility with other methods/packages.

I agree that at least we should not use MARGIN in a different meaning than other common functions do, therefore direction could be an option if we must change (which I am not sure how critical it is). Such argument could then potentially have values "rows"/"cols" instead of 1/2 if that is more clear. But I am not sure if that is any more clear because the same argumentation applies there too..

TuomasBorman commented 1 month ago

Points 1 and 2 are now fixed. The third point (MARGIN) is still open, but we have PR open about renaming. We could move the discussion there. https://github.com/microbiome/mia/pull/562

Then we could proceed with this. --> fix unit tests

antagomir commented 1 month ago

Good to me - but if this is now in conflict with the established use of MARGIN then this is urgent imo.

TuomasBorman commented 1 month ago

I think MARGIN conflicts, but 1 and 2 not. 1 and 2 are established way to specify rows and columns, but they do not conflict with our way to use them since they just mean "row" and "columns" Moreover, "rows" and "cols" are currenrtly supported by MARGIN (or the parameter name that we decide), although they are not used in examples

antagomir commented 1 month ago

If we allow something like direction then could we implicitly also allow MARGIN for users who prefer that compatibility? In the same way that 1 vs. "rows" is now allowed?

We could potentially cross-check also in Bioc Slack what the broader consensus seems to be (if any).

TuomasBorman commented 1 month ago

It is much simpler to allow different values for direction/MARGIN; it all happens under the hood. But the extra parameter is visible for user so I would not support multiple parameters for same thing.

--> I would go with direction="row" (but 1 and "rows", would still be supported but not officially)

Yes, we could ask that

TuomasBorman commented 1 month ago

So to summarize:

We have used MARGIN parameter in multiple functions.

  1. getCrossAssociation(): determine if associations are calculated for rows (1) or columns (2).
  2. addCluster(): Determines whether rows (1) or columns (2) are clustered.
  3. agglomerateByRank(): Determines whether rows (1) or columns (2) are calculated.
  4. transformAssay(): Determines whether transformation is applied row-wise (1) or column-wise (2).
  5. splitOn(): Determines whether to split based on row (1) or column (2) variable.

The thing that connects them is that all parameters specify whether the operation is done on rows or columns. So I would have common name for these parameters. But as they have little bit meaning, there is no consensus of this in Bioc packages, that is what I found out.

MARGIN has specific meaning in R as you noted (if MARGIN = 1, then the number of rows will be unchanged), so we cannot use that. That is why I would go with direction = "row" or direction = "col".

I would still keep 1, 2, "rows", and cols", as supported values as it does not cause any extra work https://github.com/microbiome/mia/blob/9198263a0f88f3776578629495b3c22fd5dfd722/R/utils.R#L139

TuomasBorman commented 1 month ago

@antagomir

TuomasBorman commented 1 month ago

Create this kind of checks


# Check that na.rm works
# Get all phyla
all_phyla <- unique( rowData(tse)$Phylum )
# When na.rm=TRUE, there should be as many rows as there are non-NA phyla
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = TRUE)
nrow(test) == length( all_phyla[!is.na(all_phyla)] )
# When na.rm = FALSE, then phyla should also include NA --> one extra row
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE)
nrow(test) == length( all_phyla )
Daenarys8 commented 1 month ago

Create this kind of checks


# Check that na.rm works
# Get all phyla
all_phyla <- unique( rowData(tse)$Phylum )
# When na.rm=TRUE, there should be as many rows as there are non-NA phyla
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = TRUE)
nrow(test) == length( all_phyla[!is.na(all_phyla)] )
# When na.rm = FALSE, then phyla should also include NA --> one extra row
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE)
nrow(test) == length( all_phyla )

na.rm works

> nrow(test) == length( all_phyla[!is.na(all_phyla)] )
[1] TRUE
> # When na.rm = FALSE, then phyla should also include NA --> one extra row
> test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE)
> nrow(test) == length( all_phyla )
[1] TRUE
TuomasBorman commented 1 month ago

I think we are almost there, thanks!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (devel@ba42574). Learn more about missing BASE report.

:exclamation: Current head 8479b6f differs from pull request most recent head 3f4485a

Please upload reports for the commit 3f4485a to get more accurate results.

Files Patch % Lines
R/merge.R 57.14% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #556 +/- ## ======================================== Coverage ? 67.40% ======================================== Files ? 41 Lines ? 4988 Branches ? 0 ======================================== Hits ? 3362 Misses ? 1626 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

TuomasBorman commented 3 weeks ago

SHould be ok after resolving conflicts