waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

wideFormat Introduces Missing Data and Changes Data Dimensions #312

Closed DarioS closed 2 years ago

DarioS commented 2 years ago

It's introducing NAs and concantenating feature IDs with sample IDs to create a huge number of new non-existent features.

> measurements
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] NanoString: matrix with 192 rows and 105 columns

 > sum(is.na(experiments(measurements)[["NanoString"]]))
[1] 0

> dataTable <- wideFormat(measurements, colDataCols = "class1vs4years", check.names = FALSE, collapse = ':')
> table(is.na(dataTable))
 FALSE   TRUE 
 20288 395004
 > dim(dataTable)
[1]   94 4418

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8    LC_MONETARY=English_Australia.utf8
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.utf8

other attached packages:
 [1] MultiAssayExperiment_1.22.0 SummarizedExperiment_1.26.1 Biobase_2.56.0  etc.

The first set of columns seem to be O.K. but the next set has column names comprised of the feature IDs and sample IDs pasted.

> colnames(dataTable)[190:199] # Last real colum is 194th column because 192 featues plus "primary" and "class1vs4years".
 [1] "NanoString:UCP2"            "NanoString:VCAM1"           "NanoString:XAF1"            "NanoString:XYLT1"           "NanoString:ZAP70"          
 [6] "NanoString:CENPB:120 rep1"  "NanoString:CTBP1:120 rep1"  "NanoString:GNB2L1:120 rep1" "NanoString:RERE:120 rep1"   "NanoString:SNRPD2:120 rep1"
LiNk-NY commented 2 years ago

Hi Dario, @DarioS

Thank you for reporting. Can you provide a reproducible example?

Best, Marcel

DarioS commented 2 years ago

Please run load(url("https://www.maths.usyd.edu.au/u/dario/measurements.RData")) and code above. Test file is 157 KB.

LiNk-NY commented 2 years ago

Hi Dario, @DarioS

Thanks for providing some data to work with. As in the data, there are replicates in the data:

anyReplicated(measurements)
#' NanoString
#'        TRUE

This means that you will have (Features X N) more columns in the data because of those replicates. Although there is a lot of missing, every column in the data has some information:

table(vapply(dataTable, function(x) all(is.na(x)), logical(1L)))
#' FALSE 
#' 4418 

In order to avoid this, it's best to remove or resolve replicates first before converting to wideFormat.

Best, Marcel

DarioS commented 2 years ago

Yes, but note that the missing values are created by MultiAssayExperiment package and are not in the input data.

> any(is.na(measurements[["Nanostring"]])) # None missing in input to wideFormat function.
  FALSE

The definition of wideFormat is

wideFormat: A function to return a wide DataFrame where each row represents an observation.

so this should not fail in the way which it does from my perspective as an end user.

colData: Each row maps to zero or more observations in each experiment in the ExperimentList.

So, an observation mean each sample and not each patient. Yet, wideFormat fails if technical replicates are present.

LiNk-NY commented 2 years ago

The data is reshaped so I would not expect to see the same shape as in the original. I will update the documentation to make that more clear. Each row in the wide format corresponds to the ID in the colData rows (patient). It doesn't fail when technical replicates are present. They are actually handled as properly as possible (by adding more sets of columns).
Note. FWIW, a sample to me is a measurement rather than an observation.

LiNk-NY commented 2 years ago

Re-opening for documentation changes