mikemc / speedyseq

Speedy versions of phyloseq functions
https://mikemc.github.io/speedyseq/
Other
46 stars 6 forks source link

Apply consistent rules about allowed sample names? #53

Open mikemc opened 4 years ago

mikemc commented 4 years ago

There seems to be inconsistent handling of sample names by various phyloseq methods. For example,

x <- y <- z <- data.frame(var1 = letters[1:3], var2 = 7:9)
rownames(x) <- c("1", "2", "3")
rownames(y) <- c("s1", "2", "3")
rownames(z) <- c("3", "2", "1")
sample_data(x)
#>     var1 var2
#> sa1    a    7
#> sa2    b    8
#> sa3    c    9
sample_data(y)
#>    var1 var2
#> s1    a    7
#> 2     b    8
#> 3     c    9
sample_data(y) %>% prune_samples(c("2", "3"), .)
#>   var1 var2
#> 2    b    8
#> 3    c    9
sample_data(z)
#>   var1 var2
#> 3    a    7
#> 2    b    8
#> 1    c    9

In addition, some phyloseq functions cause numerical sample names to be prepended with an "X", as would be done by make.names(). This happens in the results of diversity().

mikemc commented 4 years ago

It looks like the reason that only the first case results in dummy sample names is that phyloseq checks if the row names are as.character(1:n), and if so decides that sample names are missing and sets the names to "sa1", "sa2", etc.

See https://github.com/joey711/phyloseq/blob/dc35470498c79284231d41d1add1a74940f51fb7/R/sampleData-class.R#L60

mikemc commented 4 years ago

This also pops up when subsetting, e.g.

sam <- tibble::tibble(
  sample_id = c(letters[1:3], 1:3), 
  var = c(rep("a", 3), rep("b", 3)), 
) %>% sample_data
sam
#>   var
#> a   a
#> b   a
#> c   a
#> 1   b
#> 2   b
#> 3   b
sam %>% subset_samples(var == "b")
#>     var
#> sa1   b
#> sa2   b
#> sa3   b