Closed jonocarroll closed 6 years ago
How about a workaround of creating a zero-row assay with a column for every patient? E.g.:
> mat <- matrix(nrow=0, ncol=10, dimnames = list(NULL, LETTERS[1:10]))
> MultiAssayExperiment(list(empty=mat))
A MultiAssayExperiment object of 1 listed
experiment with a user-defined name and respective class.
Containing an ExperimentList class object of length 1:
[1] empty: matrix with 0 rows and 10 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>
That's just a demo, I mean you would add a matrix like this to your ExperimentList
. MultiAssayExperiment
used to allow patients with no data, but Martin and Marcel cracked down on this liberalism a while back :). I think because a workaround like this is a more explicit representation and doesn't require new exceptions to be handled in the code.
One caveat here, when subsetting this object you would need to specify drop=FALSE
to avoid losing those patients with no data, e.g.:
mae[, 1:5, , drop=FALSE]
which makes me rather wish we had made drop=FALSE
the default...
Trying it out now, I actually think drop=TRUE
is okay; I don't want them in the individual assay subsets, just an overall availability.
The way I've set this up is to add a zero-row matrix assay with my patients as the rownames, then a mapping of patientID to itself
## define an empty matrix for storing _all_ patient IDs
allpatmat <- matrix(nrow = 0,
ncol = nrow(phenodata),
dimnames = list(NULL, rownames(phenodata)))
## map patient IDs to themselves
allpatmap <- data.frame(primary = rownames(phenodata),
assay = rownames(phenodata), stringsAsFactors = FALSE)
My colData
slot contains all patients phenodata.
Requesting all patients works well as colData(mae)
. Requesting a single assay returns just the samples relevant to that assay (since I have sample IDs in the mapping, not patient IDs). Requesting a single patient gives me just the real assays they are mapped into, since the sample IDs are what's used to select.
I think this is what I was after. The additional assay doesn't seem to get in the way at all. I've named it "AllPatients"
just in case
A MultiAssayExperiment object of 3 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 3:
[1] RNAseq: ExpressionSet with 2 rows and 4 columns
[2] FMI: SummarizedExperiment with 3 rows and 3 columns
[3] AllPatients: matrix with 0 rows and 6 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert ExperimentList into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
Subsetting to just a few patients doesn't get contaminated with the AllPatients
assay (PAT4
is not listed in any assay here)
mae[, c("PAT6", "PAT8"),]
harmonizing input:
removing 8 sampleMap rows with 'colname' not in colnames of experiments
removing 4 colData rownames not in sampleMap 'primary'
A MultiAssayExperiment object of 2 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 2:
[1] RNAseq: ExpressionSet with 2 rows and 2 columns
[2] FMI: SummarizedExperiment with 3 rows and 1 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert ExperimentList into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
(if I used drop=TRUE
I'd also get the AllPatients
assay).
The only part missing is identifying the 'additional' records: the existing helper functions don't seem to have this use-case in mind, but I think this works, assuming the last assay is the one containing just patient data (hopefully I'm overlooking a simpler way):
in_at_least_one_assay <- function(x) {
suppressMessages(
assays(x) %>% # tidyr exports %>%
{length(.) - 1L} %>%
seq_len() %>%
sapply(function(a) rownames(colData(x[ , , a]))) %>%
unlist %>%
unique %>%
sort
)
}
not_in_assays <- function(x) {
setdiff(rownames(colData(x)), in_at_least_one_assay(x))
}
colData(mae[, not_in_assays(mae), ])
## returns just the record not featured in assays
Similarly of course for those who are assayed
colData(mae[, in_at_least_one_assay(mae), ])
## returns just the records which do appear in assays
Thanks for the speedy reply and very helpful workaround!
Hi Jonathan, @jonocarroll
You could identify the "additional" records with complete.cases
as long as all the other patients in the colData
have records across all assays. Otherwise, you'd also identify those patients who don't have a
record across all assays and it would be harder to tease those out.
Regards, Marcel
Does this function do what you want? It seems simpler to me, taking advantage of MAE's own default drop=TRUE
abilities:
in_at_least_one_assay <- function(x) {
suppressMessages(
x[, , seq(length(x))] %>% #corrected!
colData() %>%
rownames() %>%
sort()
)}
Thanks @lwaldron! That's much better. (replace mae
with x
to generalise, but yeah, much more succinct)
Hi @LiNk-NY; the samples themselves are entirely independent in this use-case - these are completely different assays. Not every patient is in every assay. The overlap between these is something we want to calculate easier too. I notice you have upsetSamples()
but would an upsetPatients()
be a welcome addition? If not already present, this would justify additional patient-level helpers such as (better named) in_at_least_one_assay()
above. I can't quite tell if upsetSamples()
is already supposed to be mimicking this behaviour with the idclip
argument; is the idea there that samples are correlated somehow by the sample IDs?
My use-case is all assays/phenodata from a single trial, so I'd like to generate the upset plot for "which patients appear in which assays?".
The upsetPatients()
(there's a pun in there about bad hospital food or mean nurses, I know it...) implementation wouldn't be too hard actually;
upsetPatients <- function(MultiAssayExperiment, nsets = length(MultiAssayExperiment),
nintersects = 24, order.by = "freq", ...) {
patient_vs_assay <- sampleMap(MultiAssayExperiment)[, c("assay", "primary")] %>%
as.data.frame() %>%
unique() %>%
tidyr::spread("assay", "assay")
## in case you ever import dplyr
# dplyr::mutate_at(., vars(-(primary)), funs(as.integer(!is.na(.))))
for (i in names(experiments(MultiAssayExperiment))) {
patient_vs_assay[, i] <- as.integer(!is.na(patient_vs_assay[, i]))
}
UpSetR::upset(patient_vs_assay, nsets = nsets, nintersects = nintersects,
sets = names(experiments(MultiAssayExperiment)), order.by = order.by, ...)
}
On my all-patients MAE:
upsetPatients(mae)
Neglecting the 'everyone' assay:
upsetPatients(mae[, , 1:2])
or a more standard example
library(MultiAssayExperiment)
library(tidyr) # for %>%
example("MultiAssayExperiment")
upsetPatients(myMultiAssayExperiment)
I'd be happy to PR it if this is of interest. No hassles if not.
Hi Jonathan, @jonocarroll
You might have an older version of MultiAssayExperiment
but yes, the idclip
or nameFilter
argument allows you to match sample names to patient identifiers when they are correlated such as nameFilter = function(x) substr(x, 1, 5)
in the case of TCGA data.
This argument is not necessary (it remains for aesthetics).
A more recent version of upsetSamples
makes use of the sample map to find intersections.
It does what you're describing here. Comparing all the patients across assays.
See data(miniACC); upsetSamples(miniACC)
.
Update: I guess I'm not seeing the difference between the proposed changed and upsetSamples
.
Best regards, Marcel
This is saying that 3 rownames(colData)
have measurements in all of the assays and only 1 "patient" has measurements those three assays.
upsetSamples(myMultiAssayExperiment)
Entirely possible; It seems I'm running 1.2.1 (as per Bioc 3.5) so I really should upgrade manually.
Testing out a local copy of that function; yes it produces the exact same result as mine. Cheers!!!
Thanks for opening up an issue!
I'm hoping I'll be able to contain all data relevant to a specific trial in a
MultiAssayExperiment
object and so far it's looking great. No more joining phenodata into each different assay type. What I'm hoping is also possible is storing phenodata for the entire patient list, even those who were not sequenced (e.g. early dropouts, withdrawn consents, failed pipelines). We have phenotypes for these patients but at the moment they don't appear in thesampleMap
since they have no assay.I entirely understand that samples not appearing in a subset should be removed (and I love that their phenodata is matched into the resulting assays) but is there a way (formal or workaround) to store and extract all phenodata?