Closed TuomasBorman closed 3 weeks ago
2.
I think it complicates things. The merging is done with factor and factor works like that; NA is handles as "no group" so these NA values are not grouped. Instead, now we are creating artificial "NA" group.
In the previous PR, we only added na.rm to agglomerateByVariable, agglomerateByRank already had that. However, I'm ok with that if we decide to keep na.rm also for Variable, but there were bugs in the code. When Rank is calling *Variable internally, na.rm must be hard-coded to TRUE (which was the behavior earlier before these modifications).
3.
I am not sure of the formal definition of MARGIN, but for me it is clear that MARGIN = 1 means that a) the agglomeration is applied to rows b) the variable can be found from rowData. Of course I am biased. The documentation should be modified so that it states this clearly enough. Also if that does not follow the formal definition, I guess we could stretch it little bit.
It would be useful, in principle, if user can choose whether to include "NA" group or just exclude NA cases. Being blissfully ignorant of some technicalities, I would be by default in favor of keeping na.rm as a standard argument in all those methods (perhaps with default na.rm=FALSE, which would then generate the "NA" group? But if na.rm option is too complicated then one could just include "NA" group always, and show how to remove it as a separate post processing step. However that is not equally neat solution.
This seems like an important point. Consider for instance: apply(x, 1, sum) == rowSums(x)
. Clearly, this does the operation per each row. In our case, agglomerateByVariable(tse, MARGIN = 1, f = "Family")
would operate per column. The apply family of methods is very central to the R ecosystem, I am not sure if we want to convert the margin meaning here. The other option is that we systematically align with the same conventions than apply methods have, and try to be very clear about this in all documentation.
This is ok. There was a bug that I fixed.
I understand your point. I think MARGIN = 2 when we want to agglomerate based on rowData is also problematic since it might be hard to understand. I don't have any clear ideas on how to solve this.
btw, @Daenarys8 can you make those tests simpler. Those that you made in last PR. They are quite hard to follow. You could check that when you agglomerateByRank, it matches with rowData(tse)$family for instance. Also same with agglomerateByVariable. Then you could check that you get same result with agglomerateByPrevalence + agglomeration and agglomerateByRank
3.
I think we could change the name. MARGIN is used in other functions also so we could switch them too to harmonize. For instance, direction
might be clear
I am still not certain whether 1 vs. 2 is confusing in the longer run. I think it is also a matter of convention & getting used to it. When it aligns with other R functions (apply family of functions), then it should be relatively easy to memorize. Examples are also available for those who need to check. Message could be thrown, something like "Agglomerating rows" so users will also see what is happening.
The advantage of margin 1/2 is a more direct compatibility with other methods/packages.
I agree that at least we should not use MARGIN in a different meaning than other common functions do, therefore direction
could be an option if we must change (which I am not sure how critical it is). Such argument could then potentially have values "rows"/"cols" instead of 1/2 if that is more clear. But I am not sure if that is any more clear because the same argumentation applies there too..
Points 1 and 2 are now fixed. The third point (MARGIN) is still open, but we have PR open about renaming. We could move the discussion there. https://github.com/microbiome/mia/pull/562
Then we could proceed with this. --> fix unit tests
Good to me - but if this is now in conflict with the established use of MARGIN then this is urgent imo.
I think MARGIN conflicts, but 1 and 2 not. 1 and 2 are established way to specify rows and columns, but they do not conflict with our way to use them since they just mean "row" and "columns" Moreover, "rows" and "cols" are currenrtly supported by MARGIN (or the parameter name that we decide), although they are not used in examples
If we allow something like direction
then could we implicitly also allow MARGIN for users who prefer that compatibility? In the same way that 1 vs. "rows" is now allowed?
We could potentially cross-check also in Bioc Slack what the broader consensus seems to be (if any).
It is much simpler to allow different values for direction/MARGIN; it all happens under the hood. But the extra parameter is visible for user so I would not support multiple parameters for same thing.
--> I would go with direction="row" (but 1 and "rows", would still be supported but not officially)
Yes, we could ask that
So to summarize:
We have used MARGIN parameter in multiple functions.
The thing that connects them is that all parameters specify whether the operation is done on rows or columns. So I would have common name for these parameters. But as they have little bit meaning, there is no consensus of this in Bioc packages, that is what I found out.
MARGIN has specific meaning in R as you noted (if MARGIN = 1, then the number of rows will be unchanged), so we cannot use that. That is why I would go with direction = "row"
or direction = "col"
.
I would still keep 1, 2, "rows", and cols", as supported values as it does not cause any extra work https://github.com/microbiome/mia/blob/9198263a0f88f3776578629495b3c22fd5dfd722/R/utils.R#L139
@antagomir
Create this kind of checks
# Check that na.rm works
# Get all phyla
all_phyla <- unique( rowData(tse)$Phylum )
# When na.rm=TRUE, there should be as many rows as there are non-NA phyla
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = TRUE)
nrow(test) == length( all_phyla[!is.na(all_phyla)] )
# When na.rm = FALSE, then phyla should also include NA --> one extra row
test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE)
nrow(test) == length( all_phyla )
Create this kind of checks
# Check that na.rm works # Get all phyla all_phyla <- unique( rowData(tse)$Phylum ) # When na.rm=TRUE, there should be as many rows as there are non-NA phyla test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = TRUE) nrow(test) == length( all_phyla[!is.na(all_phyla)] ) # When na.rm = FALSE, then phyla should also include NA --> one extra row test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE) nrow(test) == length( all_phyla )
na.rm works
> nrow(test) == length( all_phyla[!is.na(all_phyla)] )
[1] TRUE
> # When na.rm = FALSE, then phyla should also include NA --> one extra row
> test <- agglomerateByVariable(tse, MARGIN = 1, f="Phylum", na.rm = FALSE)
> nrow(test) == length( all_phyla )
[1] TRUE
I think we are almost there, thanks!
Attention: Patch coverage is 83.33333%
with 3 lines
in your changes missing coverage. Please review.
Please upload report for BASE (
devel@ba42574
). Learn more about missing BASE report.:exclamation: Current head 8479b6f differs from pull request most recent head 3f4485a
Please upload reports for the commit 3f4485a to get more accurate results.
Files | Patch % | Lines |
---|---|---|
R/merge.R | 57.14% | 3 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
SHould be ok after resolving conflicts
Related to this PR https://github.com/microbiome/mia/issues/419.
While polishing the code, I realized that our modifications related to na.rm and onRanksOnly were not good.
onRanksOnly in agglomerateByRank controls whether to check higher ranks or not. Basically, when onRanksOnly=TRUE, the agglomeration is done similarly to agglomerateByVariable.
Also, I think added na.rm to agglomerateByVariable was is too complicated, and does not really bring benefits.
So I suggest that we remove:
This is to show behavior in this PR
@antagomir
@Daenarys8 can you solve errors in unit tests? Also, can you check that the documentation reflects these new modifications?