waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

Things I don't like about harmonization during subsetting #290

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Problem 1

When I'm explicitly dropping data from the object, the class shouldn't bother me with unnecessary messages.

library(MultiAssayExperiment)
example(MultiAssayExperiment, echo=FALSE)
mae[,,1]
## harmonizing input:
##   removing 7 sampleMap rows not in names(experiments)

Well, yes, that's what I asked for. If this message is to be treated as a diagnostic for potential problems, its presence here is a false positive; one that encourages people to ignore the actual problems as discussed in #289.

Problem 2

Consider the following:

names(experiments(mae))
## [1] "Affy"       "Methyl450k" "RNASeqGene" "GISTIC"
names(experiments(mae[,3,]))
## [1] "Affy"       "Methyl450k" "RNASeqGene" "GISTIC"
names(experiments(mae[,4,]))
## harmonizing input:
##   removing 1 sampleMap rows with 'colname' not in colnames of experiments
## [1] "Affy"       "Methyl450k" "RNASeqGene"

First two are good. But what happened to the last one? Why did my experiment disappear without an explicit instruction?

If this is a feature, it's too smart for its own good. I would expect a zero-column SE, which avoids loss of the row annotation and changes to the numeric indices on the experiments. Say GISTIC was actually the first experiment:

experiments(mae) <- experiments(mae)[4:1]
names(experiments(mae))
## [1] "GISTIC"     "RNASeqGene" "Methyl450k" "Affy"
sub <- mae[,4,]
names(experiments(sub))
## harmonizing input:
##   removing 1 sampleMap rows with 'colname' not in colnames of experiments
## [1] "RNASeqGene" "Methyl450k" "Affy"

If I expected to get the RNA-seq experiment with sub[[2]] (which is entirely reasonable), I'd be in trouble.

LiNk-NY commented 3 years ago

Hi Aaron, @LTLA

For problem 1, we will consider having a package-wide option for removing these messages for the experienced user.

Problem 2: Perhaps this should be more prominent throughout the documentation but if you look at ?subsetBy, you will see that drop = TRUE by default. This was added to keep things in line with the generic. If you wish to keep experiments in the MultiAssayExperiment object you could do mae[, 4, , drop = FALSE]. We recommend against using numeric indices for subsetting in general because of the issues that you point out. We do recommend the use of experiment names for subsetting: sub[["RNASeqGene"]]. Best, Marcel

lwaldron commented 3 years ago

It would be worth polling MAE users & importers about some things like the drop=TRUE default, and harmonization and messages, to bring more people into the discussion and get a better idea of prevailing opinions. It's been a while since we convened a MultiAssayExperiment call but maybe the time has come. Thanks for raising these questions @LTLA.

LTLA commented 3 years ago

For problem 1, we will consider having a package-wide option for removing these messages for the experienced user.

Even as an experienced user, I have no idea which "Harmonizing input" messages are just noise and which ones are truly problematic. The package would benefit from a more judicious approach to emitting messages.

? Perhaps this should be more prominent throughout the documentation but if you look at ?subsetBy, you will see that drop = TRUE by default.

Firstly, this is not a correct interpretation of drop=TRUE. In an MAE, you have three conceptual dimensions - features, samples and experiments, based on the implementation of your [ method. Now, drop=TRUE should only have an effect when one of those dimensions is of length 1. For example, mae[,,1] could hypothetically drop the length-1 "experiment" dimension and return the underlying SE. However, the current implementation involves dropping elements of the dimension, rather than the entire dimension altogether. This has no relation to the expected behavior of drop= or drop().

Secondly, there is no obligation to actually respond to a drop=TRUE setting. The SE class doesn't, and any subsetting operation will always return another SE. Indeed, your mae[,,1] example follows this logic and returns another MAE, even though a correct treatment of drop=TRUE would require that an SE be returned.

We recommend against using numeric indices for subsetting in general because of the issues that you point out.

Well, lots of use cases involve numeric indices, so it would be troublesome if that were not supported. Moreover, the use of a name still doesn't solve my problem, because I might be doing sub[["GISTIC"]], which would fail because it disappeared.

LiNk-NY commented 3 years ago

For problem 1, we will consider having a package-wide option for removing these messages for the experienced user.

Even as an experienced user, I have no idea which "Harmonizing input" messages are just noise and which ones are truly problematic. The package would benefit from a more judicious approach to emitting messages.

I agree though it would take some sort of guessing to differentiate the problematic vs "just noise" operations. I would say it's better to be explicit and use renamePrimary for renaming rownames in the colData. I can add an error when none of the rownames in value match with those in rownames(colData).

Perhaps this should be more prominent throughout the documentation but if you look at ?subsetBy, you will see that drop = TRUE by default.

Firstly, this is not a correct interpretation of drop=TRUE. In an MAE, you have three conceptual dimensions - features, samples and experiments, based on the implementation of your [ method. Now, drop=TRUE should only have an effect when one of those dimensions is of length 1. For example, mae[,,1] could hypothetically drop the length-1 "experiment" dimension and return the underlying SE. However, the current implementation involves dropping elements of the dimension, rather than the entire dimension altogether. This has no relation to the expected behavior of drop= or drop(). Secondly, there is no obligation to actually respond to a drop=TRUE setting. The SE class doesn't, and any subsetting operation will always return another SE. Indeed, your mae[,,1] example follows this logic and returns another MAE, even though a correct treatment of drop=TRUE would require that an SE be returned.

Given that SummarizedExperiment ignores the drop= setting and works mostly in two dimensions, we took some liberty and implemented the drop argument to be of use for the third dimension. It doesn't exactly line up with the original [.data.frame drop argument obligation but we took it to provide some value when non-informative assays are present.

We recommend against using numeric indices for subsetting in general because of the issues that you point out.

Well, lots of use cases involve numeric indices, so it would be troublesome if that were not supported. Moreover, the use of a name still doesn't solve my problem, because I might be doing sub[["GISTIC"]], which would fail because it disappeared.

This would be preferred and more robust.

PS. Sorry for the edits I thought I had clicked on Quote reply

LTLA commented 3 years ago

we took some liberty and implemented the drop argument to be of use for the third dimension.

That would seem to be inappropriate. The drop= argument has a very specific meaning, and if you don't intend to follow it (which is clearly the case here), then you should use a different argument name altogether.

we took it to provide some value when non-informative assays are present.

The current defaults provide negative value - it is removing data that it has no right to remove. There is a big difference between "this experiment is here with no columns" and "there is no experiment of this name". With the former, I know the experiment was actually done, but with the latter... who knows? Maybe you removed it, maybe it was never done at all, I wouldn't be able to tell.

In practical terms, this default throws out the rowData() and metadata() that I might need later. And of course, it screws up the positional indexing of all other experiments in the subsetted object. I mean, just look at this:

library(MultiAssayExperiment)
example(MultiAssayExperiment, ask=FALSE)
sub <- mae[,0,]
experiments(sub)
## ExperimentList class object of length 0:

No experiments are retained, it's gone, it's all just gone. It is highly concerning that the defaults cannot guarantee that names(experiments(sub)) is the same as names(experiments(mae)) when I am not subsetting on the experiments.

If you want to have this non-standard behavior, I would suggest making a separate function rather than having it baked into the default of a core function like [, which is expected to behave in a very standard way. The best analogy would be droplevels().

This would be preferred and more robust.

As I said, names don't solve my problems with the current default behavior of [.

lwaldron commented 3 years ago

I appreciate hearing your opinion Aaron, but given the mature stage of MAE, we will pay attention to whether there is broader user demand for making such changes to the default subsetting behavior. In the meantime, the behavior you want is available by setting drop=FALSE.

I would add that the current default behavior facilitates common operations of subsetting followed by reshaping (wideFormat / longFormat), which would be broken under both options of the drop argument as you suggest implementing it. They would require additional steps to remove zero-row and zero-column experiments, but without unintentionally extracting a single element of the MAE list like you said should happen with drop=TRUE. I doubt anyone (even you?) would argue that MAE[, , 1, drop=TRUE] returning whatever rectangular object is in position 1 would be useful, only that it might be more consistent (if you squint hard enough) with default behavior of data.frame. If we rule that out as helpful to no one, then it becomes a question of a) changing the default of drop, and/or b) changing the name of the argument, both questions that would require a broader mandate to act on.

LiNk-NY commented 3 years ago

I agree with Levi and Aaron. drop could've used a better name so that it doesn't confuse users with expected behavior from data.frame. The current default works seamlessly with other operations as Levi mentioned and was meant as a convenience for users. Perhaps we can add a warning / message when drop code is invoked to notify users that some assays are being removed. We can also add the assay names to the drops slot to better document what is being removed.

cvanderaa commented 3 years ago

Hi all,

I would like to upvote Problem1 that was raised by @LTLA. It seems odd to me as well that a message (and now also a warning) is thrown when we explicitly want to subset some assays (and hence remove some SampleMap rows). It sounds as if we are doing something wrong.

Regarding problem 2, I think you solved it with a recent commit, right? I personally like that drop=TRUE, but I guess that's because I'm used to it. I agree that naming it drop is a little confusing.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LTLA commented 3 years ago

Yet another real-world bug caused by the current default harmonization policy:

library(MultiAssayExperiment)

se <- SummarizedExperiment(matrix(rnorm(1000), 10, 10))
rownames(se) <- 1:10
colnames(se) <- letters[1:10]
rowData(se)$IMPORTANT_STUFF_HERE <- c("Very", "important", "stuff", "was",
     "put", "here", "that", "must", "be", "retained")

sm <- DataFrame(colname=character(0), primary=character(0), assay=character(0))
mae <- MultiAssayExperiment(list(counts=se), sampleMap=sm)

experiments(mae)
## ExperimentList class object of length 0:

To force MultiAssayExperiment to behave, I needed to RTFC to get this hack:

sm$assay <- factor(sm$assay, "counts")
mae <- MultiAssayExperiment(list(counts=se), sampleMap=sm)
experiments(mae)
## ExperimentList class object of length 1:
## [1] counts: SummarizedExperiment with 10 rows and 0 columns

This should not be necessary. I should get out exactly what I put in.

LiNk-NY commented 3 years ago

I'm not sure what is the usecase for an empty experiment but the behavior looks right here.

I've added a warning for when the provided sampleMap is empty: https://github.com/waldronlab/MultiAssayExperiment/compare/8393f23..master#diff-79355c3dc57b03f2f9bd9abf97c7a33fe8b3cd956a9fe78e4b429a6e7a394f8fR306-R308 And updated the documentation to provide more info.

The more common alternative is to not provide a sampleMap and have the data preserved. Feel free to open another issue if this does not solve it. Thanks!

LTLA commented 3 years ago

Well, you can't just remove the experiment. There's no rationale for doing so. If I put in an empty experiment, I very well should get an MAE with an empty experiment back out. No other BioC container is so presumptuous as to remove empty entries for me.

LiNk-NY commented 3 years ago

The rationale is provided by the sampleMap. If the sampleMap is empty, the MultiAssayExperiment will follow that. A sampleMap such as:

sm <- DataFrame(
    assay = factor(rep("counts", length(colnames(se)))),
    primary = colnames(se),
    colname = colnames(se)
)
mae <- MultiAssayExperiment(list(counts=se), sampleMap=sm)
sampleMap(mae)
# DataFrame with 10 rows and 3 columns
#       assay     primary     colname
#    <factor> <character> <character>
# 1    counts           a           a
# 2    counts           b           b
# 3    counts           c           c
# 4    counts           d           d
# 5    counts           e           e
...

would provide the same result as the auto-generated sampleMap (via mae <- MultiAssayExperiment(list(counts=se))) and preserve the experiments.

LTLA commented 3 years ago

No, the expected behavior is to obtain an MAE with a single SE that has zero columns. You can't be throwing out an entire SE like that. Even if there's no columns, there's information in the rowData and the metadata that you're losing.

As I said before and I'll say again: harmonization should never automatically remove an experiment. That's not your decision to make, it discards data, and it breaks any script that depends on numeric indices for the experiments. And don't tell me that people shouldn't be using numeric indices; they can use it for every other BioC structure so they better be able to use it here.

LiNk-NY commented 3 years ago

The expected behavior is that you keep all the columns in the experiment. Now, this is an edge case and we're glad that you were able to find a solution.

Keeping track of 0 column experiments in the sampleMap was a challenge and that's why we used a factor for the assay column. Again, the sampleMap dictates what gets thrown out if it is not carefully created and passed into the constructor function.

As I said before and I'll say again: harmonization should never automatically remove an experiment.

This is not an issue anymore. We've changed the drop default to FALSE and added a warning when subsetting.

And don't tell me that people shouldn't be using numeric indices;

As you know, users do all sorts of things. That's the price to pay for using numeric indices, there is always a risk if the coding is not robust.

LTLA commented 3 years ago

If you're converting to a factor anyway, then you should just set the levels to contain the names of all the supplied experiments so that this dropping does not occur. And if a factor is supplied and the levels != names of the experiments, an outright error.

That's the price to pay for using numeric indices, there is always a risk if the coding is not robust.

Numeric indices are safer in many contexts where names are not guaranteed to be unique or even present. For example, assays(), or the column names of a DataFrame, or even the row names of a DataFrame. I would in fact say that numeric indices are the lower risk option in many BioC-related applications, and I would expect the same level of safety with MAEs.