I have a use for character matrices in assay data (e.g. mutations) and notice that when using longFormat() or wideFormat(), these are converted to factor, e.g.
library(MultiAssayExperiment)
## build a toy example MAE
char_data <- matrix(letters[1:12], 4, 3)
rownames(char_data) <- paste0("feat", 1:4)
colnames(char_data) <- paste0("samp", 1:3)
se <- SummarizedExperiment(list(char_data = char_data))
mae <- MultiAssayExperiment(experiments = list(char_se = se))
## outputs character data as factor due
longFormat(mae)
#> DataFrame with 12 rows and 5 columns
#> assay primary rowname colname value
#> <character> <character> <character> <character> <factor>
#> 1 char_se samp1 feat1 samp1 a
#> 2 char_se samp1 feat2 samp1 b
#> 3 char_se samp1 feat3 samp1 c
#> 4 char_se samp1 feat4 samp1 d
#> 5 char_se samp2 feat1 samp2 e
#> ... ... ... ... ... ...
#> 8 char_se samp2 feat4 samp2 h
#> 9 char_se samp3 feat1 samp3 i
#> 10 char_se samp3 feat2 samp3 j
#> 11 char_se samp3 feat3 samp3 k
#> 12 char_se samp3 feat4 samp3 l
Similarly for wideFormat()
wideFormat(mae)
#> DataFrame with 3 rows and 5 columns
#> primary char_se_feat1 char_se_feat2 char_se_feat3 char_se_feat4
#> <character> <factor> <factor> <factor> <factor>
#> 1 samp1 a b c d
#> 2 samp2 e f g h
#> 3 samp3 i j k l
This isn't of too much concern in isolation, until one tries to also use a numeric matrix
set.seed(1)
num_data <- matrix(runif(12), 4, 3)
rownames(num_data) <- paste0("feat", 1:4)
colnames(num_data) <- paste0("samp", 1:3)
se2 <- SummarizedExperiment(list(num_data = num_data))
mae2 <- MultiAssayExperiment(experiments = list(char_se = se, num_se = se2))
longFormat(mae2)
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(0.2655086631421,
#> 0.37212389963679, : invalid factor level, NA generated
#> DataFrame with 24 rows and 5 columns
#> assay primary rowname colname value
#> <character> <character> <character> <character> <factor>
#> 1 char_se samp1 feat1 samp1 a
#> 2 char_se samp1 feat2 samp1 b
#> 3 char_se samp1 feat3 samp1 c
#> 4 char_se samp1 feat4 samp1 d
#> 5 char_se samp2 feat1 samp2 e
#> ... ... ... ... ... ...
#> 20 num_se samp2 feat4 samp2 NA
#> 21 num_se samp3 feat1 samp3 NA
#> 22 num_se samp3 feat2 samp3 NA
#> 23 num_se samp3 feat3 samp3 NA
#> 24 num_se samp3 feat4 samp3 NA
where the "invalid factor levels" (the numeric data) is converted to NA.
This is circumvented if the character-assay experiment is located after the numeric one
The fact that the numeric data is converted to character is unavoidable when combined with character data in long-format, but it certainly shouldn't be converted to NA.
I have a use for character matrices in assay data (e.g. mutations) and notice that when using
longFormat()
orwideFormat()
, these are converted tofactor
, e.g.Similarly for
wideFormat()
This isn't of too much concern in isolation, until one tries to also use a numeric matrix
where the "invalid factor levels" (the numeric data) is converted to
NA
.This is circumvented if the character-assay experiment is located after the numeric one
This can be avoided by setting
options(stringsAsFactors = FALSE)
since the conversion occurs in aas.data.frame.matrix()
, in which casealthough this then becomes less reproducible as it relies on an option being set.
A preferred solution might be to add
stringsAsFactors = FALSE
to.longFormatANY()
in https://github.com/waldronlab/MultiAssayExperiment/blob/c10ad5419cf22c8788f83c3a67e74a7ecc34c058/R/MultiAssayExperiment-helpers.R#L288The fact that the numeric data is converted to character is unavoidable when combined with character data in long-format, but it certainly shouldn't be converted to
NA
.