Closed PeteHaitch closed 8 years ago
Your arguments make sense, with a couple caveats. I think that supporting subsetting by a list would not require major changes, because current subset methods first use getHits() to create a list anyways, which is then used for the actual subset (Marcel please chime in if there are other considerations I haven't thought of). The caveats I can think of are:
myMAE["someID", ]
, "someID" is dropped in the returned list elements of getHits() for any assays where it doesn't match any rowname. This allows subsetting across assays that are indexed by differing sets of IDs. If you did myMAE[1:10, ]
, what behavior would you expect if there is an assay with only four rows? An error, or those four rows? The error is probably what I would expect, although the latter would be consistent with the character subsetting behavior of MultiAssayExperiment. And what about myMAE[c(rep(TRUE, 5), rep(FALSE, 5)), ]
. Should the logical truncate or be recycled for assays that don't have 10 rows, or error if Elist elements have differing numbers of rows?However if all you really want is to be able to subset using a list of the same length as Elist, rather than on integer or logical vectors, then these aren't issues. The subsetting vector in each list element would simply be passed on to subset corresponding Elist element, whatever the outcome of those subset operations may be.
What you mention in (2) is an important point. Recycling will often lead to unexpected behaviour from the user's perspective - I've certainly been caught out by it! As for indexing beyond the dimensions of the object, there's different behaviour for different base R objects :( My instinct is an explicit error is better than it "working" but giving possibly unexpected results
> m <- matrix(1:10, ncol = 2)
> m[1:6, ]
Error in m[1:6, ] : subscript out of bounds
> as.data.frame(m)[1:6, ]
V1 V2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
NA NA NA
I'm sure you've got examples, but I'm wondering how often subsetting by a character i
will be useful across assays? That is, how often will rows have the exact same rowname across assays? In contrast, I can certainly see subsetting by a GRanges
or GRangesList
being commonly used across assays and being super helpful. These comments probably reflect my bias of mostly analysing ranged-based assays rather than "named features" assays.
OK, I think we agree that row subsetting by integer or logical should error if there would be any recycling or out-of-bounds subsets. Then logical could only be used for datasets of the same number of rows, and would probably only be useful for identically ordered rows. Integer would probably only be used for doing something like head().
Row subsetting by character can often be useful: currently, as long as you use the same feature identifiers for each assay, and perhaps in the future, more generally (@seandavi discussed ideas for automating the mapping of gene and probeset identifiers a while back, among each other and to/from GRanges, and opened issues #37 through #42 on the topic). One multi-assay but not multi-omics application of interest to me would be meta-analysis of microarray experiments from a curated package like my curatedOvarianData.
By the way, there's no need for rows to have the exact same rownames across assays, only that there is some overlap such as if you're consistently using Entrez identifiers, in order for character subsetting to be useful across assays.
Thanks for those examples, Levi. I now have a better understanding of the names you envisage being useful as rownames of feature-based assays
These are reasonable concerns but I wonder if laziness is the answer. Consider an MAE that is just a bunch of RangedSummarizedExperiments. Any indexing that works for the RSEs should work for the MAE. We should not rule out non-character i but elements of the Elist that shouldn't respond to such an i should throw error. The user then subsets the MAE to the assays to those for which the request works, or reformulates the query.
On Thu, Feb 18, 2016 at 8:31 AM, Peter Hickey notifications@github.com wrote:
I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character i, specifically logical and numeric i. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supporting i being a list or List (see below).
Here is a use case where I see it could be useful to support non-character i. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". A list index could be created using lapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)}). The user would then need to unlist() this to pass it as i in the subsetting operation. The two drawbacks I see here are:
- The user might be relying on some code that returns these matches as a logical or a numeric vector, e.g., where value = FALSE in the above grep() (perhaps indirectly called by the user buried within an exported function).
- Even when, like in the above, the operation returns an object containing only "characters" (or more formally in the form of a S4Vectors::CharacterList), the user is required to unlist() this prior to subsetting.
The second point makes me wonder whether [,MultiAssayExperiment-method needs to support i being a list or List of the appropriate length and having elements of the appropriate atomic type?
The other scenario I mentioned in the conference call where I imagine the need to subset by non-character i is to grab the first n rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.
To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or list or List, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.
— Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108.
looks like i missed the intervening discussion so ... maybe take my remarks with grain of salt, i need to read further down.
On Fri, Feb 19, 2016 at 8:43 AM, Vincent Carey stvjc@channing.harvard.edu wrote:
These are reasonable concerns but I wonder if laziness is the answer. Consider an MAE that is just a bunch of RangedSummarizedExperiments. Any indexing that works for the RSEs should work for the MAE. We should not rule out non-character i but elements of the Elist that shouldn't respond to such an i should throw error. The user then subsets the MAE to the assays to those for which the request works, or reformulates the query.
On Thu, Feb 18, 2016 at 8:31 AM, Peter Hickey notifications@github.com wrote:
I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character i, specifically logical and numeric i. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supporting i being a list or List (see below).
Here is a use case where I see it could be useful to support non-character i. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". A list index could be created using lapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)}). The user would then need to unlist() this to pass it as i in the subsetting operation. The two drawbacks I see here are:
- The user might be relying on some code that returns these matches as a logical or a numeric vector, e.g., where value = FALSE in the above grep() (perhaps indirectly called by the user buried within an exported function).
- Even when, like in the above, the operation returns an object containing only "characters" (or more formally in the form of a S4Vectors::CharacterList), the user is required to unlist() this prior to subsetting.
The second point makes me wonder whether [,MultiAssayExperiment-method needs to support i being a list or List of the appropriate length and having elements of the appropriate atomic type?
The other scenario I mentioned in the conference call where I imagine the need to subset by non-character i is to grab the first n rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.
To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or list or List, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.
— Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108.
Formalizing this action item as I understand the consensus on the conversation:
That seems a reasonable summary to me.
Don’t disallow recycling of logical vectors - just pass them on. Replace error with a warning when recycling for rows.
Subsetting columns by list / SimpleList / CharacterList isn't working (example from the vignette):
> allsamples <- colnames(myMultiAssay)
> allsamples
CharacterList of length 5
[["Affy"]] array1 array2 array3 array4
[["Methyl 450k"]] methyl1 methyl2 methyl3 methyl4 methyl5
[["Mirna"]] micro1 micro2 micro3
[["CNV gistic"]] snparray1 snparray2 snparray3
[["CNV gistic2"]] mysnparray1 mysnparray2 mysnparray3 mysnparray4
> allsamples[["Methyl 450k"]] <- allsamples[["Methyl 450k"]][1:2]
> subset(myMultiAssay, allsamples, "columns")
Error in rownames(pData(x))[y] : invalid subscript type 'S4'
> myMultiAssay[, as.list(allsamples), ]
Error in rownames(pData(x))[y] : invalid subscript type 'list'
>
Using MultiAssayExperiment_0.7.0. Make sure they work for both rows and columns, and add unit tests for all combinations.
Also, add this functionality to the API wiki.
Should a subsetting section be added to the API?
Yes, would you do the honors? On Mar 17, 2016 7:05 PM, "Marcel Ramos" notifications@github.com wrote:
Should a subsetting section be added to the API?
— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/vjcitn/MultiAssayExperiment/issues/108#issuecomment-198006559
There is a section for subsetting a MultiAssayExperiment in the API wiki now
I think it's worth considering adding support for subsetting rows of MultiAssayExperiment by non-character
i
, specifically logical and numerici
. As I thought this through, I realised that this might require a bit more work than I'd first imagined, such as supportingi
being alist
orList
(see below).Here is a use case where I see it could be useful to support non-character
i
. Suppose you want to subset by rows matching a regular expression. For example, to find all gene names beginning with "CDKN". Alist
index could be created usinglapply(rownames(MAE), function(x) {grep("^CDKN", x, value = TRUE)})
. The user would then need tounlist()
this to pass it asi
in the subsetting operation. The two drawbacks I see here are:value = FALSE
in the abovegrep()
(perhaps indirectly called by the user buried within an exported function).S4Vectors::CharacterList
), the user is required tounlist()
this prior to subsetting.The second point makes me wonder whether
[,MultiAssayExperiment-method
needs to supporti
being alist
orList
of the appropriate length and having elements of the appropriate atomic type?The other scenario I mentioned in the conference call where I imagine the need to subset by non-character
i
is to grab the firstn
rows of the MAE. As Levi pointed out, and which I hadn't considered, there is currently no real ordering of rows between assays, but I still imagine there might be scenarios where this might be desirable (for example, if the user constructed an ordering of rows between assays in their particular MAE object). There may also be a more formal ordering/linking of rows between assays in the work that Kasper and I are experimenting with.To summarise, while in general I think a user could construct the appropriate character vector from their numeric or logical vector or
list
orList
, I think it would be great if the subsetting method natively supported as many of these scenarios as possible/reasonable.