waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

Changes in subsetByColData between in 3.14 #303

Closed gogonzo closed 2 years ago

gogonzo commented 2 years ago

Hi there, I spotted changes in version for Bioc 3.14 when using subsetByColData. Code below used to work but now it throws an error

subset(
  colData(miniACC),
  subset = race == "white"
)

#> Error: logical subscript contains NAs

Previously it went fine because function use to handle NA in the condition. I think previous state was what user expects. See the definition of subset.data.frame where you can notice r & !is.na(r) which excludes NA from the subsetting operation.

function (x, subset, select, drop = FALSE, ...) 
{
    r <- if (missing(subset)) 
        rep_len(TRUE, nrow(x))
    else {
        e <- substitute(subset)
        r <- eval(e, x, parent.frame())
        if (!is.logical(r)) 
            stop("'subset' must be logical")
        r & !is.na(r)
    }
   ...
    x[r, vars, drop = drop]
}

I think that the problem lays in this line https://github.com/waldronlab/MultiAssayExperiment/blob/f256a29b46dea4e202f9cc2983210aa4acb6fb03/R/subsetBy-methods.R#L264 which I believe can be modified to

    newcoldata <- if (is.logical(y)) {
      subset(coldata, subset = y, drop = FALSE)
    } else {
      # subset would normally throws an error here
      # but MAE can be filtered also by indices (integer or rowname)
      coldata[y, , drop = FALSE]
    }

What do you think? Regards, Dawid Kaledkowski

LiNk-NY commented 2 years ago

Hi Dawid, @gogonzo

We've increased the efficiency of the subsetting operation by cleaning up some of the code and resolving an issue with multiple subsequent subsets. We directly pass along the subset vector to the colData and let DataFrame handle it. It is possible to use subset in our code but that may change depending on whether subset is meant to replicate the bracket operation [ (which is what we use now). That question would have to be answered by the DataFrame maintainers.

In the meantime, you can definitely use a more complete subset vector, e.g., miniACC$race == "white" & !is.na(miniACC$race) or which(miniACC$race == "white").

NAs should be handled explicitly either by removing them beforehand or using an argument like na.rm.

Best regards, Marcel

gogonzo commented 2 years ago

@LiNk-NY Yup, I understand that changes made were needed but I don't see why DataFrame should be responsible for handling NAs. If we refer to the base R, good example is iris[c(1, NA), ] which will return column with NA and that is why I don't expect [.DataFrame to manage missing values.

object[c(1, NA), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1           5.1         3.5          1.4         0.2  setosa
# NA           NA          NA           NA          NA    <NA>

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

var <- c(T, NA, rep(F, 148))
subset(iris, var)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa

Do you agree or disagree?

LiNk-NY commented 2 years ago

DataFrame is modeled after data.frame but it does not necessarily follow the same behavior in all cases. If subset is meant to replicate the behavior of [, then the subset code that you mention would no longer work after a corresponding update in the DataFrame interface.

I believe problem lays in MultiAssayExperiment where subsetByColData does not handle NAs as subset.default does by x[subset & !is.na(subset)]

We are not following the subset.default contract because subsetByColData is not an S3 subset method.

In order to make the handling of NAs more explicit, one possible enhancement could be to include a na.rm argument in the function. Otherwise, the user has to make sure that the input does not include NAs as when subsetting a DataFrame using the bracket method.

Best, Marcel

gogonzo commented 2 years ago

Hi @LiNk-NY thank you for the answer. I still think that problem lays in MAE, I'm even more sure now about this. There are still problems with the S4Vector which I have reported but it's not case for the MAE::subsetByColData, but it's a smaller problem as far as subset.DataFrame works as expected. After the (change)[https://github.com/Bioconductor/S4Vectors/issues/89] subsetting by empty vector is possible, and change in the MAE adjusted to this case correctly.

library(MultiAssayExperiment)

# S4Vectors and MultiAssayExperiment
DF <- miniACC@colData

subset(DF, race %in% c())
DF[DF$race %in% c(), ]
subsetByColData(miniACC, DF$race %in% c())

# base R
df <- data.frame(a = c(1, NA, 3), b = 1:3)

subset(df, a %in% c())
df[df$a %in% c(), ]

However, MAE failed to cover the case didn't fail in subsetByColData (bioc 3.13). Users see this as a change of the API of the public method. Function also has a name which suggest it's a subset (applied on a colData) and not [ which was correct until the last update.

# S4Vectors and MultiAssayExperiment
subset(DF, race == "white") # 
DF[DF$race == "white", ] # error - reported in S4Vectors
subsetByColData(miniACC, DF$race == "white") # error

# base R
df[df$a > 0, ] # return includes NAs
subset(df, a > 0) # omits NA 

I fully understand the reason - it's because subsetByColData uses [ directly to subset which is good, but not entirely. https://github.com/waldronlab/MultiAssayExperiment/blob/f256a29b46dea4e202f9cc2983210aa4acb6fb03/R/subsetBy-methods.R#L264

This line is the reason of the failure, and I wouldn't expect S4Vectors to fix this issue, because it should rather be like I've proposed below or in the first comment.

newcoldata <- coldata[y & !is.na(y), , drop = FALSE]

I can even help you to make a PR if you agree.

Regards, DK

LiNk-NY commented 2 years ago

There is a general misconception that subsetByColData should act like subset and that is not the case. Any subsetting done in MultiAssayExperiment follows the established DataFrame and related S4 data structures. If DataFrame changes how they subset, we will follow.

The removal of NA's should be explicit and preferably done by the user. I consider this to be usual and customary practice in R.

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Best regards, Marcel

gogonzo commented 2 years ago

@LiNk-NY

As above, we can consider adding an na.rm argument but that issue would have to be opened up as a feature request / pull request.

Ok, forget the whole discussion. +1 to na.rm

Regards, DK