waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

*Format converts character assay data to factor #282

Closed jonocarroll closed 3 years ago

jonocarroll commented 3 years ago

I have a use for character matrices in assay data (e.g. mutations) and notice that when using longFormat() or wideFormat(), these are converted to factor, e.g.

library(MultiAssayExperiment)

## build a toy example MAE
char_data <- matrix(letters[1:12], 4, 3)
rownames(char_data) <- paste0("feat", 1:4)
colnames(char_data) <- paste0("samp", 1:3)
se <- SummarizedExperiment(list(char_data = char_data))
mae <- MultiAssayExperiment(experiments = list(char_se = se))

## outputs character data as factor due
longFormat(mae)
#> DataFrame with 12 rows and 5 columns
#>           assay     primary     rowname     colname    value
#>     <character> <character> <character> <character> <factor>
#> 1       char_se       samp1       feat1       samp1        a
#> 2       char_se       samp1       feat2       samp1        b
#> 3       char_se       samp1       feat3       samp1        c
#> 4       char_se       samp1       feat4       samp1        d
#> 5       char_se       samp2       feat1       samp2        e
#> ...         ...         ...         ...         ...      ...
#> 8       char_se       samp2       feat4       samp2        h
#> 9       char_se       samp3       feat1       samp3        i
#> 10      char_se       samp3       feat2       samp3        j
#> 11      char_se       samp3       feat3       samp3        k
#> 12      char_se       samp3       feat4       samp3        l

Similarly for wideFormat()

wideFormat(mae)
#> DataFrame with 3 rows and 5 columns
#>       primary char_se_feat1 char_se_feat2 char_se_feat3 char_se_feat4
#>   <character>      <factor>      <factor>      <factor>      <factor>
#> 1       samp1             a             b             c             d
#> 2       samp2             e             f             g             h
#> 3       samp3             i             j             k             l

This isn't of too much concern in isolation, until one tries to also use a numeric matrix

set.seed(1)
num_data <- matrix(runif(12), 4, 3)
rownames(num_data) <- paste0("feat", 1:4)
colnames(num_data) <- paste0("samp", 1:3)
se2 <- SummarizedExperiment(list(num_data = num_data))

mae2 <- MultiAssayExperiment(experiments = list(char_se = se, num_se = se2))

longFormat(mae2)
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(0.2655086631421,
#> 0.37212389963679, : invalid factor level, NA generated
#> DataFrame with 24 rows and 5 columns
#>           assay     primary     rowname     colname    value
#>     <character> <character> <character> <character> <factor>
#> 1       char_se       samp1       feat1       samp1        a
#> 2       char_se       samp1       feat2       samp1        b
#> 3       char_se       samp1       feat3       samp1        c
#> 4       char_se       samp1       feat4       samp1        d
#> 5       char_se       samp2       feat1       samp2        e
#> ...         ...         ...         ...         ...      ...
#> 20       num_se       samp2       feat4       samp2       NA
#> 21       num_se       samp3       feat1       samp3       NA
#> 22       num_se       samp3       feat2       samp3       NA
#> 23       num_se       samp3       feat3       samp3       NA
#> 24       num_se       samp3       feat4       samp3       NA

where the "invalid factor levels" (the numeric data) is converted to NA.

This is circumvented if the character-assay experiment is located after the numeric one

mae2_rev <- mae2
experiments(mae2_rev) <- rev(experiments(mae2))
longFormat(mae2_rev)
#> DataFrame with 24 rows and 5 columns
#>           assay     primary     rowname     colname             value
#>     <character> <character> <character> <character>       <character>
#> 1        num_se       samp1       feat1       samp1   0.2655086631421
#> 2        num_se       samp1       feat2       samp1  0.37212389963679
#> 3        num_se       samp1       feat3       samp1 0.572853363351896
#> 4        num_se       samp1       feat4       samp1 0.908207789994776
#> 5        num_se       samp2       feat1       samp2 0.201681931037456
#> ...         ...         ...         ...         ...               ...
#> 20      char_se       samp2       feat4       samp2                 h
#> 21      char_se       samp3       feat1       samp3                 i
#> 22      char_se       samp3       feat2       samp3                 j
#> 23      char_se       samp3       feat3       samp3                 k
#> 24      char_se       samp3       feat4       samp3                 l

This can be avoided by setting options(stringsAsFactors = FALSE) since the conversion occurs in a as.data.frame.matrix(), in which case

options(stringsAsFactors = FALSE)
longFormat(mae2)
#> DataFrame with 24 rows and 5 columns
#>           assay     primary     rowname     colname              value
#>     <character> <character> <character> <character>        <character>
#> 1       char_se       samp1       feat1       samp1                  a
#> 2       char_se       samp1       feat2       samp1                  b
#> 3       char_se       samp1       feat3       samp1                  c
#> 4       char_se       samp1       feat4       samp1                  d
#> 5       char_se       samp2       feat1       samp2                  e
#> ...         ...         ...         ...         ...                ...
#> 20       num_se       samp2      feat 4       samp2  0.660797792486846
#> 21       num_se       samp3      feat 1       samp3   0.62911404389888
#> 22       num_se       samp3      feat 2       samp3 0.0617862704675645
#> 23       num_se       samp3      feat 3       samp3  0.205974574899301
#> 24       num_se       samp3      feat 4       samp3  0.176556752528995

although this then becomes less reproducible as it relies on an option being set.

A preferred solution might be to add stringsAsFactors = FALSE to .longFormatANY() in https://github.com/waldronlab/MultiAssayExperiment/blob/c10ad5419cf22c8788f83c3a67e74a7ecc34c058/R/MultiAssayExperiment-helpers.R#L288

The fact that the numeric data is converted to character is unavoidable when combined with character data in long-format, but it certainly shouldn't be converted to NA.

LiNk-NY commented 3 years ago

Thanks for the example. I've fixed this in 1.15.5