support saveHDF5MultiAssayExperiment with HDF5Array

waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data

https://waldronlab.io/MultiAssayExperiment/

69 stars 32 forks source link

support saveHDF5MultiAssayExperiment with HDF5Array #296

Closed LiNk-NY closed 3 years ago

LiNk-NY commented 3 years ago

Hi Hervé, @hpages Your insight in the usage of these functions is appreciated. Perhaps there can be some changes made to make these more general purpose for easier adaptability. I am currently importing with triple colons a number of internal helpers to make this work. Let me know what you think. Thank you! -Marcel

hpages commented 3 years ago

@LiNk-NY Hey Marcel,

I just took a look at this.

A big surprise to me is that the individual experiments loose their shell! For example in the example you provide, after calling saveHDF5MultiAssayExperiment() on miniACC, the 1st experiment in the object (miniACC[[1]]), which is a SummarizedExperiment object with 1 assay, is replaced with a naked HDF5Matrix. This means that all the annotations carried by the SE object (rowData, colData, metadata) are lost. Is this intended?

More generally speaking I don't understand why this saving operation needs to be lossy (as stated in the man page).

Also:

se2 <- SummarizedExperiment(list(A=array(1:24, 4:2), B=matrix(1:12, ncol=3)))
mae2 <- MultiAssayExperiment(ExperimentList(one_more_se=se2))
saveHDF5MultiAssayExperiment(mae2, "mae2")
# Error in validObject(ans) : invalid class “ExperimentList” object: 
#     Element [1] of class 'array' does not have compatible method(s): [

2 cosmetic notes:

The tile of the man page for saveHDF5MultiAssayExperiment is the full \description section. This is a very long title made of 4 sentences spanning 6 lines.
Don't you need a \value section?

LiNk-NY commented 3 years ago

Hi Hervé, @hpages

Thank you for taking a look! I appreciate the feedback.

I used the assays,MultiAssayExperiment-method to extract the data from the experiments which converts everything to a matrix. MultiAssayExperiment does not really enforce a contract for the individual experiments to have an assay-method. The ideal scenario is to keep all the data but for simplicity's sake I've used assays for now. I will look into keeping the valuable annotations of the individual experiments.

I'm not sure how to handle more than 2D matrices. assays currently only returns the first one. Also, it's quite confusing how SummarizedExperiment represents these arrays:

> assay(se2, i = 1)
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

> assay(se2, i = 2)
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

But I understand your point that it should work with the array,[-method.

Best, Marcel

hpages commented 3 years ago

I don't know, it seems that MultiAssayExperiment was designed around the assumption that SummarizedExperiment objects can only have one assay:

se2 <- SummarizedExperiment(list(A=array(1:24, 4:2), B=matrix(1:12, ncol=3)))
mae2 <- MultiAssayExperiment(ExperimentList(one_more_se=se2))
miniACC2 <- c(miniACC, mae2)
validObject(miniACC2, complete=TRUE)
# [1] TRUE
assays(miniACC2)
# Error in validObject(ans) : invalid class “ExperimentList” object: 
#     Element [6] of class 'array' does not have compatible method(s): [

Also, it's quite confusing how SummarizedExperiment represents these arrays

Not sure what you mean.

lwaldron commented 3 years ago

se2 <- SummarizedExperiment(list(A=array(1:24, 4:2), B=matrix(1:12, ncol=3)))

Both of the errors you identified come not from this having more than one assay, but from something that confused me too: that array(1:24, 4:2) errors on two-dimensional subsetting (e.g. [1, 1]), but the SummarizedExperiment it's wrapped in does not. SummarizedExperiment applies [1, 1] to the first two dimensions of the array, ignoring the 3rd dimension. As a result, the SummarizedExperiment is compatible with MultiAssayExperiment, but the array returned by assay(se2) isn't. The issue would extend to any other way an object might support obj[i, j] and assay(obj[i, j]), but not assay(obj)[i, j]. Such examples with a non-rectangular assay(obj) are going to be incompatible with reshaping operations like wideFormat, so unless there's a use case for such ExperimentList data, a safe response would be just to extend MAE's [i, j] validity check to assay(obj)[i, j]. This could break though other data management applications where the non-rectangular assays just stay hidden in their object.

hpages commented 3 years ago

Not sure why it would matter to the assays() method for MultiAssayExperiment objects that the assays are not compatible with MultiAssayExperiment. Maybe some operations like wideFormat() can't be supported on MultiAssayExperiment objects that have assays with more than 2 dimensions, but that seems better than not accepting SummarizedExperiment objects with such assays (some SummarizedExperiment derivatives like VCF objects make heavy use of this kind of assay).

According to ?ExperimentList:

The ‘ExperimentList’ class can contain several different types of data. The only requirements for an 'ExperimentList' class are that the objects contained have the following set of methods: 'dim', '[', 'dimnames'.

Any SummarizedExperiment object satisfies these requirements, even when it has assays with more than 2 dimensions. In other words SummarizedExperiment objects are always considered rectangular, whatever their content.

Note that I'm also running into issues when the 2 assays in the SE object are matrices:

se1 <- SummarizedExperiment(list(A=matrix(1:12, ncol=3), B=matrix(13:24, ncol=3)))

assays(se1)
# List of length 2
# names(2): A B

dim(assays(se1)[[1]])
# [1] 4 3
dim(assays(se1)[[2]])
# [1] 4 3

but then:

mae1 <- MultiAssayExperiment(ExperimentList(one_more_se=se1))
# Warning message:
# In .sampleMapFromData(colData, experiments) :
#  colData rownames and ExperimentList colnames are empty

assays(mae1)  # returns only 1 assay!
# List of length 1
# names(1): one_more_se

dim(assays(mae1)[[1]])  # assay is empty!
# [1] 4 0

Trying to retrieve the original SE object also gives me an object with empty assays:

experiments(mae1)[[1]]
# class: SummarizedExperiment 
# dim: 4 0 
# metadata(0):
# assays(2): A B
# rownames: NULL
# rowData names(0):
# colnames: NULL
# colData names(0):

dim(experiments(mae1)[[1]])
# [1] 4 0

LiNk-NY commented 3 years ago

Hi Hervé, @hpages

What doesn't make sense to me is that the SummarizedExperiment-class documentation says that it is matrix-like but in reality it is array-like because it supports

se2 <- SummarizedExperiment(list(A=array(1:24, 4:2), B=matrix(1:12, ncol=3)))
assay(se2, i = 1)
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

I think a better representation would be to create three assays instead of 2 to account for the third dimension in the A array. This would return an object that looks like below. I am not sure how people work with the third 'hidden' dimension when introducing an array.

# se3 <- SummarizedExperiment(list(A1=array(1:12, 4:3), A2 = array(13:24, 4:3), B=matrix(1:12, ncol=3)))
se3 <- SummarizedExperiment(list(A=array(1:24, 4:2), B=matrix(1:12, ncol=3)))
> se3
class: SummarizedExperiment 
dim: 4 3 
metadata(0):
assays(3): A1 A2 B
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):

Also, this example needs colnames in se1 for it to work:

mae1 <- MultiAssayExperiment(ExperimentList(one_more_se=se1))

Best, Marcel

hpages commented 3 years ago

Slicing the 3D array in 2D-slices is hacky and would be inconvenient and inefficient if that were to produce a high number of slices. Things would quickly become ugly with a 4D or 5D array or more. For example you would end up with 1 million assays if the dimensions of the original array is 150 x 200 x 100 x 100 x 100. Hopefully you have names along dimensions 3, 4, and 5, so you can combine/mangle them to generate the assay names. But that wouldn't be pretty nor convenient.

When the doc says that SummarizedExperiment objects are matrix-like objects I suppose that what is meant is that these objects have 2 dimensions, and support 2D-style subsetting (se[i, j]) and things like rbind(), cbind(). I personally prefer to use the term "rectangular" instead of "matrix-like" here because SummarizedExperiment objects don't closely follow the matrix contract e.g. the length of an SE object is nrow(se) but it's prod(dim(m)) for a matrix. Also SE objects don't support transposition. I could modify the man page (I didn't write it). FWIW last year I formalized the notion of rectangular objects by introducing the RectangularData class in S4Vectors. The SummarizedExperiment class extends RectangularData.

Anyways I'm still not sure why it matters that an assay can have more than 2 dimensions. How does that contradict the fact that SE objects are rectangular? The shape of the SE object is one thing, and the shape of its individual assays is another thing. Even when it has no assay, an SE object is still rectangular:

> SummarizedExperiment(rowData=letters, colData=LETTERS)
class: SummarizedExperiment 
dim: 26 26 
metadata(0):
assays(0):
rownames: NULL
rowData names(1): X
colnames: NULL
colData names(1): X

I'm also still not sure why all this matters from the MultiAssayExperiment/ExperimentList point of view. All that should matter is that a SummarizedExperiment object is always compatible with the ExperimentList requirements, whatever its assays are.

LiNk-NY commented 3 years ago

Hi Hervé, @hpages

Yes, if you extrapolate to those quantities, it can always become inefficient. AFAIU the typical use-case does not exceed more than 3D. One could even restrict the number of dimensions in the constructor function.

What would be good is to have some precision in the documentation. If arrays are supported, then it should say that otherwise, this is surprising behaviour.

I agree with the term "rectuangular" over matrix-like and it implies two dimensions. But note that any individual assay i can become a rectangular cubeoid when it is an array.

I think of the data in an SE as having three dimensions where the third dimension corresponds to assay i based on the schematic https://raw.githubusercontent.com/Bioconductor/SummarizedExperiment/master/vignettes/SE.svg and assay can even be zero as you showed above. IMO, when you start having cuboids in an assay then the schematic representation does not really correspond to matrix-like.

It matters because operations like assay(se2, i = 1)[1, 1] don't work. We were expecting matrix-like structures when in fact, they were arrays. It matters to MultiAssayExperiment because when working with multiple object classes, we want a straightforward way to obtain the data as a list of matrices with assays without having to make special cases for rectangular cubiods. Yes, I can always add more code to make this work but if the contract says matrix-like, it should behave like a matrix rather than an array or indicate array-like.

Best, Marcel

hpages commented 3 years ago

I think of the data in an SE as having three dimensions where the third dimension corresponds to assay i based

Me too... sometimes. But note that length(dim(se)) is always 2 and SE objects do NOT support 3D-style subsetting (se[i, j, k]). So that mental representation doesn't really help. We should just stick to the notion that SE are 2D objects. When I say that I'm not trying to enforce my vision of SE objects on you. It's just what the SummarizedExperiment API tells us e.g. things like dim(se), dimnames(se), se[i, j] reveal a 2D nature, not a 3D nature.

It matters because operations like assay(se2, i = 1)[1, 1] don't work.

Nobody said they have to. The only guarantee is that se2[1, 1] will always work, whatever number of dimensions assay(se2, i=1) has. If you use assay(se2[1, 1], i=1) instead of assay(se2, i = 1)[1, 1], you should be good.

we want a straightforward way to obtain the data as a list of matrices with assays without having to make special cases for rectangular cubiods

Again I understand that not all operations (e.g. wideFormat()) will know how to handle assays with more than 2 dimensions but "that's their problem", so to speak. It should not be a problem as far as construction and basic manipulation of MultiAssayExperiment objects are concerned. In particular I don't think that the assays() getter should care about that (the assays() getter for SummarizedExperiment objects doesn't).

lwaldron commented 3 years ago

Again I understand that not all operations (e.g. wideFormat()) will know how to handle assays with more than 2 dimensions but "that's their problem", so to speak. It should not be a problem as far as construction and basic manipulation of MultiAssayExperiment objects are concerned. In particular I don't think that the assays() getter should care about that (the assays() getter for SummarizedExperiment objects doesn't).

I think this is just an unusual use case that hasn't been raised before, and @LiNk-NY and I (and MultiAssayExperiment) assumed that if obj was rectangle-like and assay(obj) existed, then assay(obj) also would be rectangle-like. I would support erroring only when necessary in that situation, as opposed to turning that assumption into a validity check, but I don't know yet what would be involved.