Closed LTLA closed 3 years ago
I would be hesitant to create a new default like harmonize=FALSE
for the amount of existing code it would break, although a warning with a pointer to ?prepMultiAssay
would be less painful. Just curious, have you tried or do you find useful prepMultiAssay()
for troubleshooting construction?
The problem is not (only) construction; it's in the innocuous assignment operations that make data disappear. Any attempt to rename columns (or the MAE or SE) is particularly dangerous. For example, if I decided that I'd like to rename my SE's columns:
colnames(mae[[1]]) <- paste0("X", colnames(mae[[1]]))
## harmonizing input:
## removing 1000 sampleMap rows with 'colname' not in colnames of experiments
## removing 10 colData rownames not in sampleMap 'primary'
I can't imagine a situation where the above behavior is a desirable outcome. I would prefer an error along the lines of "names in experiment 1 are different from those in sampleMap". I would go so far as to say that - constructors aside - any user-level code that breaks with a hypothetical harmonize=FALSE
setting was probably not doing the right thing in the first place.
An analogous change in SummarizedExperiment is the recent withDimnames=FALSE
requirement in the assay<-()
setter, if you want to put in a matrix with different dimnames from the SE. In the last release, failing to do so would cause a warning; now it causes an error, because it probably wasn't the right thing to do unless you knew what you were doing.
Hi Aaron, @LTLA Thanks for expressing your concerns.
Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the sampleMap
or colData
. We also encourage the use of prepMultiAssay
for new users who are not as familiar with MultiAssayExperiment
as you are.
Of course, we can't prevent the user from trying all sorts of things with the internals of MultiAssayExperiment
.
We can however provide helper functions that make the renaming a bit easier. These would be named something like renamePrimary
and renameColnames
(in reference to the columns in the sampleMap
).
Other than that I would really recommend that the data be pre-processed before MultiAssayExperiment
construction so to avoid renaming post-creation.
Best,
Marcel
Renaming samples in a SummarizedExperiment is a common operation that I at least use all the time. I don't think it is some esoteric use case where the user dig into the internals.
To me, this is an example of something where the API needs to get fixed. I understand that MAE have tons of situations which are not well covered by the standard semantics we are used to, but I find Aaron's example pretty compelling. I am not a heavy user of MAE, but I am surprised to see this.
In my opinion, this might be worth breaking backwards compatibility for.
Best, Kasper
On Wed, Jan 6, 2021 at 7:11 PM Marcel Ramos notifications@github.com wrote:
Hi Aaron, @LTLA https://github.com/LTLA Thanks for expressing your concerns.
Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the sampleMap or colData. We also encourage the use of prepMultiAssay for new users who are not as familiar with MultiAssayExperiment as you are.
Of course, we can't prevent the user from trying all sorts of things with the internals of MultiAssayExperiment. We can however provide helper functions that make the renaming a bit easier. These would be named something like renamePrimary and renameColnames (in reference to the columns in the sampleMap).
Other than that I would really recommend that the data be pre-processed before MultiAssayExperiment construction so to avoid renaming post-creation. Best, Marcel
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/waldronlab/MultiAssayExperiment/issues/289#issuecomment-755467132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHZUFU4NJLFNAIEAKC3SYSRV7ANCNFSM4U3TZYNQ .
-- Best, Kasper
Looking at SummarizedExperiment as a template, it throws an error when attempting to change dimnames in the assay but automatically propagates changes to the rownames of colData:
suppressPackageStartupMessages(library(SummarizedExperiment))
example("SummarizedExperiment", echo = FALSE)
# attempting to rename assay columns causes error
colnames(assay(rse2)) <- paste0("X", colnames(assay(rse2)))
#> Error in `assays<-`(`*tmp*`, withDimnames = withDimnames, ..., value = `*vtmp*`): please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x,
#> withDimnames=FALSE)) <- value' when the dimnames on the supplied
#> assay(s) are not identical to the dimnames on
#> RangedSummarizedExperiment object 'x'
# `withDimnames = FALSE` causes this renaming attempt to be ignored:
colnames(assay(rse2, withDimnames = FALSE)) <- paste0("ignored", colnames(assay(rse2)))
colnames(assay(rse2))
#> [1] "A" "B" "C" "D" "E" "F"
# Note that the `withDimnames = FALSE` allows one to break the assay-colData relationship:
assay(rse2, withDimnames = FALSE) <- assay(rse2)[, ncol(assay(rse2)):1]
# But renaming rows of colData is automatically propagated to the assay:
rownames(colData(rse2)) <- paste0("X", rownames(colData(rse2)))
colnames(assay(rse2))
#> [1] "XA" "XB" "XC" "XD" "XE" "XF"
rownames(colData(rse2))
#> [1] "XA" "XB" "XC" "XD" "XE" "XF"
# This behavior is the same in LTLA's example:
modified <- colData(rse2)
rownames(modified) <- tolower(rownames(modified))
colData(rse2) <- modified
colnames(assay(rse2))
#> [1] "xa" "xb" "xc" "xd" "xe" "xf"
rownames(colData(rse2))
#> [1] "xa" "xb" "xc" "xd" "xe" "xf"
We've usually tried to follow SummarizedExperiment as a template, and it does seem reasonable here too, except that 1) modifying the rownames(colData(mae))
would then modify the sampleMap rather than the assay colnames, and 2) I'm not sure there's a rationale for the withDimnames=FALSE
option in this context. It's probably true that there are few if any use cases for wanting to drop data by renaming dims, and emulating SummarizedExperiment behavior here would avoid that and may be closer to what @kasperdanielhansen and the users @LTLA refers to are expecting.
To me, Aaron's example is about colnames(mae) <- which is directly doing something to the MAE, whereas Levi's SE example is about colnames(assay(se)) <- where it goes through the accessor function.
Perhaps I am misunderstanding something. I have not run the code myself, so it is entirely possible I am confused.
On Thu, Jan 7, 2021 at 12:20 PM Levi Waldron notifications@github.com wrote:
Looking at SummarizedExperiment as a template, it throws an error when attempting to change dimnames in the assay but automatically propagates changes to the rownames of colData:
suppressPackageStartupMessages(library(SummarizedExperiment)) example("SummarizedExperiment", echo = FALSE)
attempting to rename assay columns causes error
colnames(assay(rse2)) <- paste0("X", colnames(assay(rse2)))#> Error in
assays<-
(*tmp*
, withDimnames = withDimnames, ..., value =*vtmp*
): please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x,#> withDimnames=FALSE)) <- value' when the dimnames on the supplied#> assay(s) are not identical to the dimnames on#> RangedSummarizedExperiment object 'x'
withDimnames = FALSE
causes this renaming attempt to be ignored:colnames(assay(rse2, withDimnames = FALSE)) <- paste0("ignored", colnames(assay(rse2))) colnames(assay(rse2))#> [1] "A" "B" "C" "D" "E" "F"
Note that the
withDimnames = FALSE
allows one to break the assay-colData relationship:assay(rse2, withDimnames = FALSE) <- assay(rse2)[, ncol(assay(rse2)):1]
But renaming rows of colData is automatically propagated to the assay:
rownames(colData(rse2)) <- paste0("X", rownames(colData(rse2))) colnames(assay(rse2))#> [1] "XA" "XB" "XC" "XD" "XE" "XF" rownames(colData(rse2))#> [1] "XA" "XB" "XC" "XD" "XE" "XF"
This behavior is the same in LTLA's example:modified <- colData(rse2)
rownames(modified) <- tolower(rownames(modified)) colData(rse2) <- modified colnames(assay(rse2))#> [1] "xa" "xb" "xc" "xd" "xe" "xf" rownames(colData(rse2))#> [1] "xa" "xb" "xc" "xd" "xe" "xf"
We've usually tried to follow SummarizedExperiment as a template, and it does seem reasonable here too, except that 1) modifying the rownames(colData(mae)) would then modify the sampleMap rather than the assay colnames, and 2) I'm not sure there's a rationale for the withDimnames=FALSE option in this context. It's probably true that there are few if any use cases for wanting to drop data by renaming dims, and emulating SummarizedExperiment behavior here would avoid that and may be closer to what @kasperdanielhansen https://github.com/kasperdanielhansen and the users @LTLA https://github.com/LTLA refers to are expecting.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldronlab/MultiAssayExperiment/issues/289#issuecomment-756055897, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH2DZA4I5KTS2M2MF2DSYWKJVANCNFSM4U3TZYNQ .
-- Best, Kasper
To me, Aaron's example is about colnames(mae) <- which is directly doing something to the MAE, whereas Levi's SE example is about colnames(assay(se)) <- where it goes through the accessor function.
Both were using accessor/setter functions, the first post using colData
and the second using [[
. There's no setter colnames(mae) <-
implemented.
Hi all, @lwaldron @kasperdanielhansen @LTLA
Thanks for chiming in. I see the point that Kasper is making. We don't have a formal colnames<-
replacement method.
I've worked on one to take either a list
or CharacterList
of renamed sample names. The caveat is that the lengths()
have to match in both the MultiAssayExperiment
and the provided renaming value
.
cc <- colnames(mae)
cc[[1]] <- toupper(colnames(mae)[[1]])
colnames(mae) <- cc
colnames(mae)
#' CharacterList of length 4
#' [["Affy"]] ARRAY1 ARRAY2 ARRAY3 ARRAY4
#' [["Methyl450k"]] methyl1 methyl2 methyl3 methyl4 methyl5
#' [["RNASeqGene"]] samparray1 samparray2 samparray3 samparray4
#' [["GISTIC"]] samp0 samp1 samp2
I have also included helper functions e.g., for doing the same on one experiment.
This should not introduce any breaking changes unless there are users who are not following
our API (colnames(x)<-value
was not implemented before).
See the changes here: https://github.com/waldronlab/MultiAssayExperiment/compare/master..renames
Best, Marcel
Our original intent in providing a harmonization function was for convenience to the user with dangling data not listed in the
sampleMap
orcolData
.
I would suggest that you limit that to the constructor. Currently, the harmonization procedure seems to be called everywhere and it gives the class complete license to discard data at any time. There are only a few operations where loss of data is expected, i.e., [
; any other time, it is unlikely to be a desirable effect, as described above for renaming.
Of course, we can't prevent the user from trying all sorts of things with the internals of
MultiAssayExperiment
.
None of the operations above involved the class internals. All of them used publicly exported setters, e.g., [[<-
, colnames<-
. These should behave in a coherent and unsurprising manner. Loss of data is not expected from a simple renaming.
modifying the
rownames(colData(mae))
would then modify the sampleMap rather than the assay colnames
This should be the correct behavior, as discussed in #288.
I'm not sure there's a rationale for the
withDimnames=FALSE
option in this context.
The withDimnames=FALSE
is just an analogy. In your case, most functions would have a harmonize=
option that defaults to FALSE
but people could set to TRUE
to do "unsafe" (i..e., data discarding) operations if they know what they're doing. If they don't know what they're doing, then it is probably not correct to automatically harmonize.
Hi Aaron, @LTLA
I would suggest that you limit that to the constructor. Currently, the harmonization procedure seems to be called everywhere and it gives the class complete license to discard data at any time. There are only a few operations where loss of data is expected, i.e.,
[
; any other time, it is unlikely to be a desirable effect, as described above for renaming.
We harmonize during construction and when the sampleMap
, ExperimentList
, or colData
is replaced. This is to keep the data in the class tidy and traceable. If you add or remove rows in the colData
that should be reflected in the sampleMap
, without .harmonize
this wouldn't be the case. Running harmonize post object-creation maintains consistency, avoids invalid objects, and prevents further problems down the road when doing manipulations. Users can still shoot themselves in the foot if these operations are not properly used. I understand that you may need to rename your samples during an analysis and I have provided colnames<-,MultiAssayExperiment-method
along with renamePrimary
and renameColname
to help.
None of the operations above involved the class internals. All of them used publicly exported setters, e.g.,
[[<-
,colnames<-
. These should behave in a coherent and unsurprising manner. Loss of data is not expected from a simple renaming.
I mean to say that these renaming operations are outside of the provided API. The only valid exported setter is [[<-
, colnames<-,MultiAssayExperiment-method
does not exist (yet) and it's using the internal class's method.
modifying the
rownames(colData(mae))
would then modify the sampleMap rather than the assay colnamesThis should be the correct behavior, as discussed in #288.
We would like to keep renaming and replacement operations separate to avoid trampling on the sampleMap
hence the new renamePrimary
helper. It would be hard to separate these operations if you were to use colData<-
with quite a different colData
. As a reminder, a clean and ready-to-go colData
would avoid these issues.
cc: @kasperdanielhansen @lwaldron
Running harmonize post object-creation maintains consistency, avoids invalid objects, and prevents further problems down the road when doing manipulations.
It only "avoids invalid objects" by removing data, which causes another set of problems down the road. You can't really claim that your object is meaningfully consistent when all the data is gone!
If you really wanted to avoid problems, you would throw an error right away to force the user to clarify any ambiguities or discrepancies, rather than harmonizing in the background and undermining downstream analyses by stripping out all the data.
I understand that you may need to rename your samples during an analysis and I have provided
colnames<-,MultiAssayExperiment-method
along withrenamePrimary
andrenameColname
to help.
The renaming is just a symptom. The cause is the fact that your data structure does not behave coherently with combinations of manipulations, e.g., renaming an individual experiment, renaming the colData
, replacing the colData
.
How many of these combinations can you expect to plug with dedicated functions? Wouldn't it be a lot better to make sure that each manipulation produces a coherent result, such that combinations of those operations are also coherent?
I mean to say that these renaming operations are outside of the provided API.
The "provided API" includes all combinations of exported operations. You supplied a [[<-
method for experiment-level replacement, and so I am fully entitled to use it. In fact, colnames(mae[[1]]) <- new_names
just decomposes to:
tmp <- mae[[1]]
colnames(tmp) <- new_names
mae[[1]] <- tmp
We would like to keep renaming and replacement operations separate to avoid trampling on the
sampleMap
hence the newrenamePrimary
helper. It would be hard to separate these operations if you were to usecolData<-
with quite a differentcolData
. As a reminder, a clean and ready-to-gocolData
would avoid these issues.
If you want to preserve the validity of the sampleMap
, then it's very easy to do so.
For colData<-
:
value
must be a DF with the same number of rows as whatever it's replacing (and non-duplicated row names).sampleMap$primary
with their new names.For [[<-
.
[[<-
must be an object with the same number of columns as whatever it's replacing, and non-duplicated column names.sampleMap$colname
with their new names.This strategy maintains consistency without the need for harmonization. If you must do a replacement with [[<-
with a different number of columns, it would be best to have a completely different function. This is a data-discarding operation that either removes information from the sampleMap
or from value
, and so it deserves more explicit acknowledgement than a few easily-ignored messages when harmonization is run in the background.
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
The dreaded "Harmonizing input" message is too glib for what it does, which is effectively to discard data from the object. Usually this happens in response to a relatively innocuous oversight, e.g., #287 or below:
And suddenly I have an object with no data. This is not intentional and, in fact, I would say that I have never encountered an instance where the automatic harmonization produced an object that I actually wanted. Why not emit an outright error? Users can just subset the
mae
explicitly to get rid of things we don't want, which is a more explicit approach to dropping data.If there is a valid application of the auto-harmonization, it should require an explicit specification, e.g.,
harmonize=TRUE
. One can imagine this being the case forsampleMap<-()
orexperiments<-()
. You might even be able to provide sensible error messages customized for each setter, e.g., the above call might raise an errror with the message "mismatch in rownames of colData; if you're sure you want this, setharmonize=TRUE
."These comments come from answering a lot of questions about the MAE from internal users who just see their data... disappear. It's hard enough to get them to pay attention to warnings, no one's going to pay any attention to a message.