waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
67 stars 32 forks source link

Subsetting rows of MultiAssayExperiment by non-character i #108

Closed PeteHaitch closed 8 years ago

PeteHaitch commented 8 years ago

I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character i, specifically logical and numeric i. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supporting i being a list or List (see below).

Here is a use case where I see it could be useful to support non-character i. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". A list index could be created using lapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)}). The user would then need to unlist() this to pass it as i in the subsetting operation. The two drawbacks I see here are:

  1. The user might be relying on some code that returns these matches as a logical or a numeric vector, e.g., where value = FALSE in the above grep() (perhaps indirectly called by the user buried within an exported function).
  2. Even when, like in the above, the operation returns an object containing only "characters" (or more formally in the form of a S4Vectors::CharacterList), the user is required to unlist() this prior to subsetting.

The second point makes me wonder whether [,MultiAssayExperiment-method needs to support i being a list or List of the appropriate length and having elements of the appropriate atomic type?

The other scenario I mentioned in the conference call where I imagine the need to subset by non-character i is to grab the first n rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.

To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or list or List, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.

lwaldron commented 8 years ago

Your arguments make sense, with a couple caveats. I think that supporting subsetting by a list would not require major changes, because current subset methods first use getHits() to create a list anyways, which is then used for the actual subset (Marcel please chime in if there are other considerations I haven't thought of). The caveats I can think of are:

  1. Formal validity check is only that "[" is available, not anything about what it does. For some data types (e.g. GRangesList/RangedRaggedAssay), it may not be obvious what logical or integer subsetting should do. But these odder cases don't need to restrict more common use cases.
  2. When doing myMAE["someID", ], "someID" is dropped in the returned list elements of getHits() for any assays where it doesn't match any rowname. This allows subsetting across assays that are indexed by differing sets of IDs. If you did myMAE[1:10, ], what behavior would you expect if there is an assay with only four rows? An error, or those four rows? The error is probably what I would expect, although the latter would be consistent with the character subsetting behavior of MultiAssayExperiment. And what about myMAE[c(rep(TRUE, 5), rep(FALSE, 5)), ]. Should the logical truncate or be recycled for assays that don't have 10 rows, or error if Elist elements have differing numbers of rows?

However if all you really want is to be able to subset using a list of the same length as Elist, rather than on integer or logical vectors, then these aren't issues. The subsetting vector in each list element would simply be passed on to subset corresponding Elist element, whatever the outcome of those subset operations may be.

PeteHaitch commented 8 years ago

What you mention in (2) is an important point. Recycling will often lead to unexpected behaviour from the user's perspective - I've certainly been caught out by it! As for indexing beyond the dimensions of the object, there's different behaviour for different base R objects :( My instinct is an explicit error is better than it "working" but giving possibly unexpected results

> m <- matrix(1:10, ncol = 2)
> m[1:6, ]
Error in m[1:6, ] : subscript out of bounds
> as.data.frame(m)[1:6, ]
   V1 V2
1   1  6
2   2  7
3   3  8
4   4  9
5   5 10
NA NA NA

I'm sure you've got examples, but I'm wondering how often subsetting by a character i will be useful across assays? That is, how often will rows have the exact same rowname across assays? In contrast, I can certainly see subsetting by a GRanges or GRangesList being commonly used across assays and being super helpful. These comments probably reflect my bias of mostly analysing ranged-based assays rather than "named features" assays.

lwaldron commented 8 years ago

OK, I think we agree that row subsetting by integer or logical should error if there would be any recycling or out-of-bounds subsets. Then logical could only be used for datasets of the same number of rows, and would probably only be useful for identically ordered rows. Integer would probably only be used for doing something like head().

Row subsetting by character can often be useful: currently, as long as you use the same feature identifiers for each assay, and perhaps in the future, more generally (@seandavi discussed ideas for automating the mapping of gene and probeset identifiers a while back, among each other and to/from GRanges, and opened issues #37 through #42 on the topic). One multi-assay but not multi-omics application of interest to me would be meta-analysis of microarray experiments from a curated package like my curatedOvarianData.

lwaldron commented 8 years ago

By the way, there's no need for rows to have the exact same rownames across assays, only that there is some overlap such as if you're consistently using Entrez identifiers, in order for character subsetting to be useful across assays.

PeteHaitch commented 8 years ago

Thanks for those examples, Levi. I now have a better understanding of the names you envisage being useful as rownames of feature-based assays

vjcitn commented 8 years ago

These are reasonable concerns but I wonder if laziness is the answer. Consider an MAE that is just a bunch of RangedSummarizedExperiments. Any indexing that works for the RSEs should work for the MAE. We should not rule out non-character i but elements of the Elist that shouldn't respond to such an i should throw error. The user then subsets the MAE to the assays to those for which the request works, or reformulates the query.

On Thu, Feb 18, 2016 at 8:31 AM, Peter Hickey notifications@github.com wrote:

I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character i, specifically logical and numeric i. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supporting i being a list or List (see below).

Here is a use case where I see it could be useful to support non-character i. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". A list index could be created using lapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)}). The user would then need to unlist() this to pass it as i in the subsetting operation. The two drawbacks I see here are:

  1. The user might be relying on some code that returns these matches as a logical or a numeric vector, e.g., where value = FALSE in the above grep() (perhaps indirectly called by the user buried within an exported function).
  2. Even when, like in the above, the operation returns an object containing only "characters" (or more formally in the form of a S4Vectors::CharacterList), the user is required to unlist() this prior to subsetting.

The second point makes me wonder whether [,MultiAssayExperiment-method needs to support i being a list or List of the appropriate length and having elements of the appropriate atomic type?

The other scenario I mentioned in the conference call where I imagine the need to subset by non-character i is to grab the first n rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.

To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or list or List, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.

— Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108.

vjcitn commented 8 years ago

looks like i missed the intervening discussion so ... maybe take my remarks with grain of salt, i need to read further down.

On Fri, Feb 19, 2016 at 8:43 AM, Vincent Carey stvjc@channing.harvard.edu wrote:

These are reasonable concerns but I wonder if laziness is the answer. Consider an MAE that is just a bunch of RangedSummarizedExperiments. Any indexing that works for the RSEs should work for the MAE. We should not rule out non-character i but elements of the Elist that shouldn't respond to such an i should throw error. The user then subsets the MAE to the assays to those for which the request works, or reformulates the query.

On Thu, Feb 18, 2016 at 8:31 AM, Peter Hickey notifications@github.com wrote:

I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character i, specifically logical and numeric i. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supporting i being a list or List (see below).

Here is a use case where I see it could be useful to support non-character i. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". A list index could be created using lapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)}). The user would then need to unlist() this to pass it as i in the subsetting operation. The two drawbacks I see here are:

  1. The user might be relying on some code that returns these matches as a logical or a numeric vector, e.g., where value = FALSE in the above grep() (perhaps indirectly called by the user buried within an exported function).
  2. Even when, like in the above, the operation returns an object containing only "characters" (or more formally in the form of a S4Vectors::CharacterList), the user is required to unlist() this prior to subsetting.

The second point makes me wonder whether [,MultiAssayExperiment-method needs to support i being a list or List of the appropriate length and having elements of the appropriate atomic type?

The other scenario I mentioned in the conference call where I imagine the need to subset by non-character i is to grab the first n rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.

To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or list or List, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.

— Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108.

lwaldron commented 8 years ago

Formalizing this action item as I understand the consensus on the conversation:

PeteHaitch commented 8 years ago

That seems a reasonable summary to me.

lwaldron commented 8 years ago

Don’t disallow recycling of logical vectors - just pass them on. Replace error with a warning when recycling for rows.

lwaldron commented 8 years ago

Subsetting columns by list / SimpleList / CharacterList isn't working (example from the vignette):

> allsamples <- colnames(myMultiAssay)
> allsamples
CharacterList of length 5
[["Affy"]] array1 array2 array3 array4
[["Methyl 450k"]] methyl1 methyl2 methyl3 methyl4 methyl5
[["Mirna"]] micro1 micro2 micro3
[["CNV gistic"]] snparray1 snparray2 snparray3
[["CNV gistic2"]] mysnparray1 mysnparray2 mysnparray3 mysnparray4
> allsamples[["Methyl 450k"]] <- allsamples[["Methyl 450k"]][1:2]
> subset(myMultiAssay, allsamples, "columns")
Error in rownames(pData(x))[y] : invalid subscript type 'S4'
> myMultiAssay[, as.list(allsamples), ]
Error in rownames(pData(x))[y] : invalid subscript type 'list'
> 

Using MultiAssayExperiment_0.7.0. Make sure they work for both rows and columns, and add unit tests for all combinations.

lwaldron commented 8 years ago

Also, add this functionality to the API wiki.

LiNk-NY commented 8 years ago

Should a subsetting section be added to the API?

lwaldron commented 8 years ago

Yes, would you do the honors? On Mar 17, 2016 7:05 PM, "Marcel Ramos" notifications@github.com wrote:

Should a subsetting section be added to the API?

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108#issuecomment-198006559

LiNk-NY commented 8 years ago

There is a section for subsetting a MultiAssayExperiment in the API wiki now