Harmonization should be more noisy and complaining (or better yet, an error)

waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data

https://waldronlab.io/MultiAssayExperiment/

69 stars 32 forks source link

Harmonization should be more noisy and complaining (or better yet, an error) #289

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

The dreaded "Harmonizing input" message is too glib for what it does, which is effectively to discard data from the object. Usually this happens in response to a relatively innocuous oversight, e.g., #287 or below:

library(MultiAssayExperiment)

# Making up some data.
mat <- matrix(rpois(10000, 5), nrow=10)
rownames(mat) <- paste0("CD", seq_len(nrow(mat))) # made-up markers
colnames(mat) <- paste0("BARCODE_", seq_len(ncol(mat))) # made-up barcodes

# First building the SE. Assuming that barcodes come from a range of donors.
donor <- rep(LETTERS[1:10], each=100)
se <- SummarizedExperiment(list(counts=mat))
se$Donor <- donor

# Now building the MAE. Each 'column' of the MAE corresponds to a donor;
# this is specified by the sample mapping from donors (primary) to barcodes (colname).
mae <- MultiAssayExperiment(list(markers=se),
    sampleMap=data.frame(assay="markers", primary=se$Donor, colname=colnames(se)),
    colData=DataFrame(row.names=LETTERS[1:10])
)

# Mocking up a replacement DF with different row names.
modified <- colData(mae)
rownames(modified) <- tolower(rownames(modified))

mae2 <- mae
colData(mae2) <- modified
## harmonizing input:
##   removing 1000 sampleMap rows with 'primary' not in colData
##   removing 10 colData rownames not in sampleMap 'primary'

# or even just this!
mae2 <- mae
rownames(colData(mae2)) <- letters[1:10]
## harmonizing input:
##   removing 1000 sampleMap rows with 'primary' not in colData
##   removing 10 colData rownames not in sampleMap 'primary'

And suddenly I have an object with no data. This is not intentional and, in fact, I would say that I have never encountered an instance where the automatic harmonization produced an object that I actually wanted. Why not emit an outright error? Users can just subset the mae explicitly to get rid of things we don't want, which is a more explicit approach to dropping data.

If there is a valid application of the auto-harmonization, it should require an explicit specification, e.g., harmonize=TRUE. One can imagine this being the case for sampleMap<-() or experiments<-(). You might even be able to provide sensible error messages customized for each setter, e.g., the above call might raise an errror with the message "mismatch in rownames of colData; if you're sure you want this, set harmonize=TRUE."

These comments come from answering a lot of questions about the MAE from internal users who just see their data... disappear. It's hard enough to get them to pay attention to warnings, no one's going to pay any attention to a message.

lwaldron commented 3 years ago

I would be hesitant to create a new default like harmonize=FALSE for the amount of existing code it would break, although a warning with a pointer to ?prepMultiAssay would be less painful. Just curious, have you tried or do you find useful prepMultiAssay() for troubleshooting construction?

LTLA commented 3 years ago

The problem is not (only) construction; it's in the innocuous assignment operations that make data disappear. Any attempt to rename columns (or the MAE or SE) is particularly dangerous. For example, if I decided that I'd like to rename my SE's columns:

colnames(mae[[1]]) <- paste0("X", colnames(mae[[1]]))
## harmonizing input:
##   removing 1000 sampleMap rows with 'colname' not in colnames of experiments
##   removing 10 colData rownames not in sampleMap 'primary'

I can't imagine a situation where the above behavior is a desirable outcome. I would prefer an error along the lines of "names in experiment 1 are different from those in sampleMap". I would go so far as to say that - constructors aside - any user-level code that breaks with a hypothetical harmonize=FALSE setting was probably not doing the right thing in the first place.

An analogous change in SummarizedExperiment is the recent withDimnames=FALSE requirement in the assay<-() setter, if you want to put in a matrix with different dimnames from the SE. In the last release, failing to do so would cause a warning; now it causes an error, because it probably wasn't the right thing to do unless you knew what you were doing.

LiNk-NY commented 3 years ago

Hi Aaron, @LTLA Thanks for expressing your concerns.

Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the sampleMap or colData. We also encourage the use of prepMultiAssay for new users who are not as familiar with MultiAssayExperiment as you are.

Of course, we can't prevent the user from trying all sorts of things with the internals of MultiAssayExperiment. We can however provide helper functions that make the renaming a bit easier. These would be named something like renamePrimary and renameColnames (in reference to the columns in the sampleMap).

Other than that I would really recommend that the data be pre-processed before MultiAssayExperiment construction so to avoid renaming post-creation. Best, Marcel

kasperdanielhansen commented 3 years ago

Renaming samples in a SummarizedExperiment is a common operation that I at least use all the time. I don't think it is some esoteric use case where the user dig into the internals.

To me, this is an example of something where the API needs to get fixed. I understand that MAE have tons of situations which are not well covered by the standard semantics we are used to, but I find Aaron's example pretty compelling. I am not a heavy user of MAE, but I am surprised to see this.

In my opinion, this might be worth breaking backwards compatibility for.

Best, Kasper

On Wed, Jan 6, 2021 at 7:11 PM Marcel Ramos notifications@github.com wrote:

Hi Aaron, @LTLA https://github.com/LTLA Thanks for expressing your concerns.

Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the sampleMap or colData. We also encourage the use of prepMultiAssay for new users who are not as familiar with MultiAssayExperiment as you are.

Of course, we can't prevent the user from trying all sorts of things with the internals of MultiAssayExperiment. We can however provide helper functions that make the renaming a bit easier. These would be named something like renamePrimary and renameColnames (in reference to the columns in the sampleMap).

Other than that I would really recommend that the data be pre-processed before MultiAssayExperiment construction so to avoid renaming post-creation. Best, Marcel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/waldronlab/MultiAssayExperiment/issues/289#issuecomment-755467132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHZUFU4NJLFNAIEAKC3SYSRV7ANCNFSM4U3TZYNQ .

-- Best, Kasper

lwaldron commented 3 years ago

Looking at SummarizedExperiment as a template, it throws an error when attempting to change dimnames in the assay but automatically propagates changes to the rownames of colData:

suppressPackageStartupMessages(library(SummarizedExperiment))
example("SummarizedExperiment", echo = FALSE)

# attempting to rename assay columns causes error
colnames(assay(rse2)) <- paste0("X", colnames(assay(rse2)))
#> Error in `assays<-`(`*tmp*`, withDimnames = withDimnames, ..., value = `*vtmp*`): please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x,
#>   withDimnames=FALSE)) <- value' when the dimnames on the supplied
#>   assay(s) are not identical to the dimnames on
#>   RangedSummarizedExperiment object 'x'

# `withDimnames = FALSE` causes this renaming attempt to be ignored:
colnames(assay(rse2, withDimnames = FALSE)) <- paste0("ignored", colnames(assay(rse2)))
colnames(assay(rse2))
#> [1] "A" "B" "C" "D" "E" "F"

# Note that the `withDimnames = FALSE` allows one to break the assay-colData relationship:
assay(rse2, withDimnames = FALSE) <- assay(rse2)[, ncol(assay(rse2)):1]

# But renaming rows of colData is automatically propagated to the assay:
rownames(colData(rse2)) <- paste0("X", rownames(colData(rse2)))
colnames(assay(rse2))
#> [1] "XA" "XB" "XC" "XD" "XE" "XF"
rownames(colData(rse2))
#> [1] "XA" "XB" "XC" "XD" "XE" "XF"

# This behavior is the same in LTLA's example:
modified <- colData(rse2)
rownames(modified) <- tolower(rownames(modified))
colData(rse2) <- modified
colnames(assay(rse2))
#> [1] "xa" "xb" "xc" "xd" "xe" "xf"
rownames(colData(rse2))
#> [1] "xa" "xb" "xc" "xd" "xe" "xf"

We've usually tried to follow SummarizedExperiment as a template, and it does seem reasonable here too, except that 1) modifying the rownames(colData(mae)) would then modify the sampleMap rather than the assay colnames, and 2) I'm not sure there's a rationale for the withDimnames=FALSE option in this context. It's probably true that there are few if any use cases for wanting to drop data by renaming dims, and emulating SummarizedExperiment behavior here would avoid that and may be closer to what @kasperdanielhansen and the users @LTLA refers to are expecting.

kasperdanielhansen commented 3 years ago

To me, Aaron's example is about colnames(mae) <- which is directly doing something to the MAE, whereas Levi's SE example is about colnames(assay(se)) <- where it goes through the accessor function.

Perhaps I am misunderstanding something. I have not run the code myself, so it is entirely possible I am confused.

On Thu, Jan 7, 2021 at 12:20 PM Levi Waldron notifications@github.com wrote:

Looking at SummarizedExperiment as a template, it throws an error when attempting to change dimnames in the assay but automatically propagates changes to the rownames of colData:

suppressPackageStartupMessages(library(SummarizedExperiment)) example("SummarizedExperiment", echo = FALSE)

attempting to rename assay columns causes error

colnames(assay(rse2)) <- paste0("X", colnames(assay(rse2)))#> Error in assays<-(*tmp*, withDimnames = withDimnames, ..., value = *vtmp*): please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x,#> withDimnames=FALSE)) <- value' when the dimnames on the supplied#> assay(s) are not identical to the dimnames on#> RangedSummarizedExperiment object 'x'

withDimnames = FALSE causes this renaming attempt to be ignored:

colnames(assay(rse2, withDimnames = FALSE)) <- paste0("ignored", colnames(assay(rse2))) colnames(assay(rse2))#> [1] "A" "B" "C" "D" "E" "F"

Note that the withDimnames = FALSE allows one to break the assay-colData relationship:

assay(rse2, withDimnames = FALSE) <- assay(rse2)[, ncol(assay(rse2)):1]

But renaming rows of colData is automatically propagated to the assay:

rownames(colData(rse2)) <- paste0("X", rownames(colData(rse2))) colnames(assay(rse2))#> [1] "XA" "XB" "XC" "XD" "XE" "XF" rownames(colData(rse2))#> [1] "XA" "XB" "XC" "XD" "XE" "XF"

This behavior is the same in LTLA's example:modified <- colData(rse2)

rownames(modified) <- tolower(rownames(modified)) colData(rse2) <- modified colnames(assay(rse2))#> [1] "xa" "xb" "xc" "xd" "xe" "xf" rownames(colData(rse2))#> [1] "xa" "xb" "xc" "xd" "xe" "xf"

We've usually tried to follow SummarizedExperiment as a template, and it does seem reasonable here too, except that 1) modifying the rownames(colData(mae)) would then modify the sampleMap rather than the assay colnames, and 2) I'm not sure there's a rationale for the withDimnames=FALSE option in this context. It's probably true that there are few if any use cases for wanting to drop data by renaming dims, and emulating SummarizedExperiment behavior here would avoid that and may be closer to what @kasperdanielhansen https://github.com/kasperdanielhansen and the users @LTLA https://github.com/LTLA refers to are expecting.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/MultiAssayExperiment/issues/289#issuecomment-756055897, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH2DZA4I5KTS2M2MF2DSYWKJVANCNFSM4U3TZYNQ .

-- Best, Kasper

lwaldron commented 3 years ago

To me, Aaron's example is about colnames(mae) <- which is directly doing something to the MAE, whereas Levi's SE example is about colnames(assay(se)) <- where it goes through the accessor function.

Both were using accessor/setter functions, the first post using colData and the second using [[. There's no setter colnames(mae) <- implemented.

LiNk-NY commented 3 years ago

Hi all, @lwaldron @kasperdanielhansen @LTLA

Thanks for chiming in. I see the point that Kasper is making. We don't have a formal colnames<- replacement method. I've worked on one to take either a list or CharacterList of renamed sample names. The caveat is that the lengths() have to match in both the MultiAssayExperiment and the provided renaming value.

cc <- colnames(mae)
cc[[1]] <- toupper(colnames(mae)[[1]])
colnames(mae) <- cc
colnames(mae)
#' CharacterList of length 4
#' [["Affy"]] ARRAY1 ARRAY2 ARRAY3 ARRAY4
#' [["Methyl450k"]] methyl1 methyl2 methyl3 methyl4 methyl5
#' [["RNASeqGene"]] samparray1 samparray2 samparray3 samparray4
#' [["GISTIC"]] samp0 samp1 samp2

I have also included helper functions e.g., for doing the same on one experiment. This should not introduce any breaking changes unless there are users who are not following our API (colnames(x)<-value was not implemented before). See the changes here: https://github.com/waldronlab/MultiAssayExperiment/compare/master..renames

Best, Marcel

LTLA commented 3 years ago

Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the sampleMap or colData.

I would suggest that you limit that to the constructor. Currently, the harmonization procedure seems to be called everywhere and it gives the class complete license to discard data at any time. There are only a few operations where loss of data is expected, i.e., [; any other time, it is unlikely to be a desirable effect, as described above for renaming.

Of course, we can't prevent the user from trying all sorts of things with the internals of MultiAssayExperiment.

None of the operations above involved the class internals. All of them used publicly exported setters, e.g., [[<-, colnames<-. These should behave in a coherent and unsurprising manner. Loss of data is not expected from a simple renaming.

modifying the rownames(colData(mae)) would then modify the sampleMap rather than the assay colnames

This should be the correct behavior, as discussed in #288.

I'm not sure there's a rationale for the withDimnames=FALSE option in this context.

The withDimnames=FALSE is just an analogy. In your case, most functions would have a harmonize= option that defaults to FALSE but people could set to TRUE to do "unsafe" (i..e., data discarding) operations if they know what they're doing. If they don't know what they're doing, then it is probably not correct to automatically harmonize.

LiNk-NY commented 3 years ago

Hi Aaron, @LTLA

I would suggest that you limit that to the constructor. Currently, the harmonization procedure seems to be called everywhere and it gives the class complete license to discard data at any time. There are only a few operations where loss of data is expected, i.e., [; any other time, it is unlikely to be a desirable effect, as described above for renaming.

We harmonize during construction and when the sampleMap, ExperimentList, or colData is replaced. This is to keep the data in the class tidy and traceable. If you add or remove rows in the colData that should be reflected in the sampleMap, without .harmonize this wouldn't be the case. Running harmonize post object-creation maintains consistency, avoids invalid objects, and prevents further problems down the road when doing manipulations. Users can still shoot themselves in the foot if these operations are not properly used. I understand that you may need to rename your samples during an analysis and I have provided colnames<-,MultiAssayExperiment-method along with renamePrimary and renameColname to help.

None of the operations above involved the class internals. All of them used publicly exported setters, e.g., [[<-, colnames<-. These should behave in a coherent and unsurprising manner. Loss of data is not expected from a simple renaming.

I mean to say that these renaming operations are outside of the provided API. The only valid exported setter is [[<-, colnames<-,MultiAssayExperiment-method does not exist (yet) and it's using the internal class's method.

modifying the rownames(colData(mae)) would then modify the sampleMap rather than the assay colnames

This should be the correct behavior, as discussed in #288.

We would like to keep renaming and replacement operations separate to avoid trampling on the sampleMap hence the new renamePrimary helper. It would be hard to separate these operations if you were to use colData<- with quite a different colData. As a reminder, a clean and ready-to-go colData would avoid these issues.

cc: @kasperdanielhansen @lwaldron

LTLA commented 3 years ago

Running harmonize post object-creation maintains consistency, avoids invalid objects, and prevents further problems down the road when doing manipulations.

It only "avoids invalid objects" by removing data, which causes another set of problems down the road. You can't really claim that your object is meaningfully consistent when all the data is gone!

If you really wanted to avoid problems, you would throw an error right away to force the user to clarify any ambiguities or discrepancies, rather than harmonizing in the background and undermining downstream analyses by stripping out all the data.

I understand that you may need to rename your samples during an analysis and I have provided colnames<-,MultiAssayExperiment-method along with renamePrimary and renameColname to help.

The renaming is just a symptom. The cause is the fact that your data structure does not behave coherently with combinations of manipulations, e.g., renaming an individual experiment, renaming the colData, replacing the colData.

How many of these combinations can you expect to plug with dedicated functions? Wouldn't it be a lot better to make sure that each manipulation produces a coherent result, such that combinations of those operations are also coherent?

I mean to say that these renaming operations are outside of the provided API.

The "provided API" includes all combinations of exported operations. You supplied a [[<- method for experiment-level replacement, and so I am fully entitled to use it. In fact, colnames(mae[[1]]) <- new_names just decomposes to:

tmp <- mae[[1]]
colnames(tmp) <- new_names
mae[[1]] <- tmp

We would like to keep renaming and replacement operations separate to avoid trampling on the sampleMap hence the new renamePrimary helper. It would be hard to separate these operations if you were to use colData<- with quite a different colData. As a reminder, a clean and ready-to-go colData would avoid these issues.

If you want to preserve the validity of the sampleMap, then it's very easy to do so.

For colData<-:

value must be a DF with the same number of rows as whatever it's replacing (and non-duplicated row names).
Create a mapping between the old names and the new names.
Replace all values in sampleMap$primary with their new names.

For [[<-.

[[<- must be an object with the same number of columns as whatever it's replacing, and non-duplicated column names.
Create a mapping between the old names and the new names.
Replace all values in sampleMap$colname with their new names.

This strategy maintains consistency without the need for harmonization. If you must do a replacement with [[<- with a different number of columns, it would be best to have a completely different function. This is a data-discarding operation that either removes information from the sampleMap or from value, and so it deserves more explicit acknowledgement than a few easily-ignored messages when harmonization is run in the background.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

waldronlab / MultiAssayExperiment

Harmonization should be more noisy and complaining (or better yet, an error) #289

attempting to rename assay columns causes error

withDimnames = FALSE causes this renaming attempt to be ignored:

Note that the withDimnames = FALSE allows one to break the assay-colData relationship:

But renaming rows of colData is automatically propagated to the assay:

This behavior is the same in LTLA's example:modified <- colData(rse2)

`withDimnames = FALSE` causes this renaming attempt to be ignored:

Note that the `withDimnames = FALSE` allows one to break the assay-colData relationship: