waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

Store and extract records not involved in any assay #239

Closed jonocarroll closed 6 years ago

jonocarroll commented 6 years ago

I'm hoping I'll be able to contain all data relevant to a specific trial in a MultiAssayExperiment object and so far it's looking great. No more joining phenodata into each different assay type. What I'm hoping is also possible is storing phenodata for the entire patient list, even those who were not sequenced (e.g. early dropouts, withdrawn consents, failed pipelines). We have phenotypes for these patients but at the moment they don't appear in the sampleMap since they have no assay.

I entirely understand that samples not appearing in a subset should be removed (and I love that their phenodata is matched into the resulting assays) but is there a way (formal or workaround) to store and extract all phenodata?

lwaldron commented 6 years ago

How about a workaround of creating a zero-row assay with a column for every patient? E.g.:

> mat <- matrix(nrow=0, ncol=10, dimnames = list(NULL, LETTERS[1:10]))
> MultiAssayExperiment(list(empty=mat))
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class. 
 Containing an ExperimentList class object of length 1: 
 [1] empty: matrix with 0 rows and 10 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices
> 

That's just a demo, I mean you would add a matrix like this to your ExperimentList. MultiAssayExperiment used to allow patients with no data, but Martin and Marcel cracked down on this liberalism a while back :). I think because a workaround like this is a more explicit representation and doesn't require new exceptions to be handled in the code.

lwaldron commented 6 years ago

One caveat here, when subsetting this object you would need to specify drop=FALSE to avoid losing those patients with no data, e.g.:

mae[, 1:5, , drop=FALSE]

which makes me rather wish we had made drop=FALSE the default...

jonocarroll commented 6 years ago

Trying it out now, I actually think drop=TRUE is okay; I don't want them in the individual assay subsets, just an overall availability.

The way I've set this up is to add a zero-row matrix assay with my patients as the rownames, then a mapping of patientID to itself

## define an empty matrix for storing _all_ patient IDs
allpatmat <- matrix(nrow = 0, 
                    ncol = nrow(phenodata), 
                    dimnames = list(NULL, rownames(phenodata)))

## map patient IDs to themselves
allpatmap <- data.frame(primary = rownames(phenodata),
                        assay = rownames(phenodata), stringsAsFactors = FALSE)

My colData slot contains all patients phenodata.

Requesting all patients works well as colData(mae). Requesting a single assay returns just the samples relevant to that assay (since I have sample IDs in the mapping, not patient IDs). Requesting a single patient gives me just the real assays they are mapped into, since the sample IDs are what's used to select.

I think this is what I was after. The additional assay doesn't seem to get in the way at all. I've named it "AllPatients" just in case

A MultiAssayExperiment object of 3 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 3: 
 [1] RNAseq: ExpressionSet with 2 rows and 4 columns 
 [2] FMI: SummarizedExperiment with 3 rows and 3 columns 
 [3] AllPatients: matrix with 0 rows and 6 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert ExperimentList into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices

Subsetting to just a few patients doesn't get contaminated with the AllPatients assay (PAT4 is not listed in any assay here)

mae[, c("PAT6", "PAT8"),]
harmonizing input:
  removing 8 sampleMap rows with 'colname' not in colnames of experiments
  removing 4 colData rownames not in sampleMap 'primary'
A MultiAssayExperiment object of 2 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 2: 
 [1] RNAseq: ExpressionSet with 2 rows and 2 columns 
 [2] FMI: SummarizedExperiment with 3 rows and 1 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert ExperimentList into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices

(if I used drop=TRUE I'd also get the AllPatients assay).

The only part missing is identifying the 'additional' records: the existing helper functions don't seem to have this use-case in mind, but I think this works, assuming the last assay is the one containing just patient data (hopefully I'm overlooking a simpler way):

in_at_least_one_assay <- function(x) {
  suppressMessages(
    assays(x) %>% # tidyr exports %>%
      {length(.) - 1L} %>% 
      seq_len() %>% 
      sapply(function(a) rownames(colData(x[ , , a]))) %>% 
      unlist %>% 
      unique %>% 
      sort 
  )
}

not_in_assays <- function(x) {
  setdiff(rownames(colData(x)), in_at_least_one_assay(x))
}

colData(mae[, not_in_assays(mae), ])
## returns just the record not featured in assays

Similarly of course for those who are assayed

colData(mae[, in_at_least_one_assay(mae), ])
## returns just the records which do appear in assays

Thanks for the speedy reply and very helpful workaround!

LiNk-NY commented 6 years ago

Hi Jonathan, @jonocarroll

You could identify the "additional" records with complete.cases as long as all the other patients in the colData have records across all assays. Otherwise, you'd also identify those patients who don't have a record across all assays and it would be harder to tease those out.

Regards, Marcel

lwaldron commented 6 years ago

Does this function do what you want? It seems simpler to me, taking advantage of MAE's own default drop=TRUE abilities:

in_at_least_one_assay <- function(x) {
  suppressMessages(
    x[, , seq(length(x))] %>%  #corrected!
      colData() %>%
      rownames() %>%
      sort()
)}
jonocarroll commented 6 years ago

Thanks @lwaldron! That's much better. (replace mae with x to generalise, but yeah, much more succinct)

Hi @LiNk-NY; the samples themselves are entirely independent in this use-case - these are completely different assays. Not every patient is in every assay. The overlap between these is something we want to calculate easier too. I notice you have upsetSamples() but would an upsetPatients()be a welcome addition? If not already present, this would justify additional patient-level helpers such as (better named) in_at_least_one_assay() above. I can't quite tell if upsetSamples() is already supposed to be mimicking this behaviour with the idclip argument; is the idea there that samples are correlated somehow by the sample IDs?

My use-case is all assays/phenodata from a single trial, so I'd like to generate the upset plot for "which patients appear in which assays?".

jonocarroll commented 6 years ago

The upsetPatients() (there's a pun in there about bad hospital food or mean nurses, I know it...) implementation wouldn't be too hard actually;

upsetPatients <- function(MultiAssayExperiment, nsets = length(MultiAssayExperiment), 
                          nintersects = 24, order.by = "freq", ...) {

  patient_vs_assay <- sampleMap(MultiAssayExperiment)[, c("assay", "primary")] %>% 
    as.data.frame() %>%
    unique() %>% 
    tidyr::spread("assay", "assay") 
  ## in case you ever import dplyr
  # dplyr::mutate_at(., vars(-(primary)), funs(as.integer(!is.na(.)))) 
  for (i in names(experiments(MultiAssayExperiment))) {
    patient_vs_assay[, i] <- as.integer(!is.na(patient_vs_assay[, i]))
  }
  UpSetR::upset(patient_vs_assay, nsets = nsets, nintersects = nintersects, 
        sets = names(experiments(MultiAssayExperiment)), order.by = order.by, ...)

}

On my all-patients MAE:

upsetPatients(mae)

image

Neglecting the 'everyone' assay:

upsetPatients(mae[, , 1:2])

image

or a more standard example

library(MultiAssayExperiment)
library(tidyr) # for %>%
example("MultiAssayExperiment")
upsetPatients(myMultiAssayExperiment)

image

I'd be happy to PR it if this is of interest. No hassles if not.

LiNk-NY commented 6 years ago

Hi Jonathan, @jonocarroll

You might have an older version of MultiAssayExperiment but yes, the idclip or nameFilter argument allows you to match sample names to patient identifiers when they are correlated such as nameFilter = function(x) substr(x, 1, 5) in the case of TCGA data.

This argument is not necessary (it remains for aesthetics). A more recent version of upsetSamples makes use of the sample map to find intersections. It does what you're describing here. Comparing all the patients across assays. See data(miniACC); upsetSamples(miniACC).

Update: I guess I'm not seeing the difference between the proposed changed and upsetSamples.

Best regards, Marcel

LiNk-NY commented 6 years ago

This is saying that 3 rownames(colData) have measurements in all of the assays and only 1 "patient" has measurements those three assays. rplot upsetSamples(myMultiAssayExperiment)

jonocarroll commented 6 years ago

Entirely possible; It seems I'm running 1.2.1 (as per Bioc 3.5) so I really should upgrade manually.

Testing out a local copy of that function; yes it produces the exact same result as mine. Cheers!!!

LiNk-NY commented 6 years ago

Thanks for opening up an issue!