Closed gogonzo closed 2 years ago
Hi Dawid, @gogonzo
We've increased the efficiency of the subsetting operation by cleaning up some of the code and resolving an issue with multiple subsequent subsets. We directly pass along the subset vector to the colData
and let DataFrame
handle it.
It is possible to use subset
in our code but that may change depending on whether subset is meant to replicate the bracket operation [
(which is what we use now). That question would have to be answered by the DataFrame
maintainers.
In the meantime, you can definitely use a more complete subset vector, e.g.,
miniACC$race == "white" & !is.na(miniACC$race)
or which(miniACC$race == "white")
.
NA
s should be handled explicitly either by removing them beforehand or using an argument like na.rm
.
Best regards, Marcel
@LiNk-NY
Yup,
I understand that changes made were needed but I don't see why DataFrame
should be responsible for handling NAs. If we refer to the base R, good example is iris[c(1, NA), ]
which will return column with NA
and that is why I don't expect [.DataFrame
to manage missing values.
object[c(1, NA), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# NA NA NA NA NA <NA>
I believe problem lays in MultiAssayExperiment
where subsetByColData
does not handle NAs as subset.default
does by x[subset & !is.na(subset)]
var <- c(T, NA, rep(F, 148))
subset(iris, var)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
Do you agree or disagree?
DataFrame
is modeled after data.frame
but it does not necessarily follow the same behavior in
all cases. If subset
is meant to replicate the behavior of [
, then the subset code that you mention
would no longer work after a corresponding update in the DataFrame
interface.
I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]
We are not following the subset.default
contract because subsetByColData
is not an S3
subset method.
In order to make the handling of NA
s more explicit, one possible enhancement could be to
include a na.rm
argument in the function. Otherwise, the user has to make sure that the
input does not include NA
s as when subsetting a DataFrame
using the bracket method.
Best, Marcel
Hi @LiNk-NY thank you for the answer. I still think that problem lays in MAE, I'm even more sure now about this. There are still problems with the S4Vector which I have reported but it's not case for the MAE::subsetByColData
, but it's a smaller problem as far as subset.DataFrame
works as expected.
After the (change)[https://github.com/Bioconductor/S4Vectors/issues/89]
subsetting by empty vector is possible, and change in the MAE adjusted to this
case correctly.
library(MultiAssayExperiment)
# S4Vectors and MultiAssayExperiment
DF <- miniACC@colData
subset(DF, race %in% c())
DF[DF$race %in% c(), ]
subsetByColData(miniACC, DF$race %in% c())
# base R
df <- data.frame(a = c(1, NA, 3), b = 1:3)
subset(df, a %in% c())
df[df$a %in% c(), ]
However, MAE failed to cover the case didn't fail in subsetByColData
(bioc 3.13). Users see this as a change of the API of the public method. Function also has a name which suggest it's a subset
(applied on a colData
) and not [
which was correct until the last update.
# S4Vectors and MultiAssayExperiment
subset(DF, race == "white") #
DF[DF$race == "white", ] # error - reported in S4Vectors
subsetByColData(miniACC, DF$race == "white") # error
# base R
df[df$a > 0, ] # return includes NAs
subset(df, a > 0) # omits NA
I fully understand the reason - it's because subsetByColData
uses [
directly to subset which is good, but not entirely.
https://github.com/waldronlab/MultiAssayExperiment/blob/f256a29b46dea4e202f9cc2983210aa4acb6fb03/R/subsetBy-methods.R#L264
This line is the reason of the failure, and I wouldn't expect S4Vectors to fix this issue, because it should rather be like I've proposed below or in the first comment.
newcoldata <- coldata[y & !is.na(y), , drop = FALSE]
I can even help you to make a PR if you agree.
Regards, DK
There is a general misconception that subsetByColData
should act like subset
and that is not the case.
Any subsetting done in MultiAssayExperiment
follows the established DataFrame
and related S4 data structures.
If DataFrame
changes how they subset, we will follow.
The removal of NA
's should be explicit and preferably done by the user. I consider this to be usual and customary practice in R.
As above, we can consider adding an na.rm
argument but that issue would have to be opened up as a feature request / pull request.
Best regards, Marcel
@LiNk-NY
As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.
Ok, forget the whole discussion. +1 to na.rm
Regards, DK
Hi there, I spotted changes in version for Bioc 3.14 when using
subsetByColData
. Code below used to work but now it throws an errorPreviously it went fine because function use to handle
NA
in the condition. I think previous state was what user expects. See the definition ofsubset.data.frame
where you can noticer & !is.na(r)
which excludesNA
from the subsetting operation.I think that the problem lays in this line https://github.com/waldronlab/MultiAssayExperiment/blob/f256a29b46dea4e202f9cc2983210aa4acb6fb03/R/subsetBy-methods.R#L264 which I believe can be modified to
What do you think? Regards, Dawid Kaledkowski