waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
125 stars 28 forks source link

rownames not matching in tse and assay when using rownames = NCBI #298

Closed sdgamboa closed 12 months ago

sdgamboa commented 1 year ago

It seems that rownames in the TSE and the assay don't match when using the rownames = 'NCBI' option in curatedMetagenomicData? I think this prevents the use of tidySummarizedExperiment to automatically convert to tibble: https://github.com/stemangiola/tidySummarizedExperiment/issues/70

suppressMessages({
    library(curatedMetagenomicData)
    library(tidySummarizedExperiment)
})
dataset_name <- "HallAB_2017.relative_abundance"
tse <- curatedMetagenomicData(
    pattern = dataset_name, 
    dryrun = FALSE, rownames = 'NCBI',
    counts = TRUE
)[[1]]
#> 
#> $`2021-10-14.HallAB_2017.relative_abundance`
#> dropping rows without rowTree matches:
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#>   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#>   k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
#>   k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
#>   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis
tse 
#> class: TreeSummarizedExperiment 
#> dim: 503 259 
#> metadata(1): agglomerated_by_rank
#> assays(1): relative_abundance
#> rownames(503): 853 820 ... 172901 1262744
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(259): p8582_mo1 p8582_mo10 ... SKST041_2_G103027
#>   SKST041_3_G103028
#> colData names(24): study_name subject_id ... HBI SCCAI
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (503 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
class(tse)
#> [1] "TreeSummarizedExperiment"
#> attr(,"package")
#> [1] "TreeSummarizedExperiment"
head(rownames(assay(tse)))
#> [1] "853"    "820"    "301301" "28117"  "357276" "39491"
head(rownames(assay(tse, "relative_abundance", withDimnames = FALSE)))
#> [1] "1239_186801_186802_216572_216851_853" 
#> [2] "976_200643_171549_815_816_820"        
#> [3] "1239_186801_186802_186803_841_301301" 
#> [4] "976_200643_171549_171550_239759_28117"
#> [5] "976_200643_171549_815_909656_357276"  
#> [6] "1239_186801_186802_186803_NA_39491"
tidy_tse <- tidySummarizedExperiment::as_tibble(tse)
#> Error in `map2()`:
#> ℹ In index: 1.
#> ℹ With name: relative_abundance.
#> Caused by error in `.x[rownames(se), , drop = FALSE]`:
#> ! subscript out of bounds
#> Backtrace:
#>      ▆
#>   1. ├─tidySummarizedExperiment::as_tibble(tse)
#>   2. ├─tidySummarizedExperiment:::as_tibble.SummarizedExperiment(tse)
#>   3. │ └─tidySummarizedExperiment:::.as_tibble_optimised(...)
#>   4. │   └─tidySummarizedExperiment:::get_count_datasets(x)
#>   5. │     ├─... %>% ...
#>   6. │     └─purrr::map2(...)
#>   7. │       └─purrr:::map2_("list", .x, .y, .f, ..., .progress = .progress)
#>   8. │         ├─purrr:::with_indexed_errors(...)
#>   9. │         │ └─base::withCallingHandlers(...)
#>  10. │         ├─purrr:::call_with_cleanup(...)
#>  11. │         └─tidySummarizedExperiment (local) .f(.x[[i]], .y[[i]], ...)
#>  12. ├─purrr::when(...)
#>  13. ├─purrr::when(...)
#>  14. └─purrr (local) `<fn>`(`<sbscOOBE>`)
#>  15.   └─cli::cli_abort(...)
#>  16.     └─rlang::abort(...)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       Pop!_OS 22.04 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-08-15
#>  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                  * version   date (UTC) lib source
#>  abind                      1.4-5     2016-07-21 [1] RSPM (R 4.3.1)
#>  AnnotationDbi              1.62.2    2023-07-02 [1] Bioconductor
#>  AnnotationHub              3.8.0     2023-04-25 [1] Bioconductor
#>  ape                        5.7-1     2023-03-13 [1] RSPM (R 4.3.1)
#>  beachmat                   2.16.0    2023-04-25 [1] Bioconductor
#>  beeswarm                   0.4.0     2021-06-01 [1] RSPM (R 4.3.1)
#>  Biobase                  * 2.60.0    2023-04-25 [1] Bioconductor
#>  BiocFileCache              2.8.0     2023-04-25 [1] Bioconductor
#>  BiocGenerics             * 0.46.0    2023-04-25 [1] Bioconductor
#>  BiocManager                1.30.22   2023-08-08 [1] RSPM (R 4.3.0)
#>  BiocNeighbors              1.18.0    2023-04-25 [1] Bioconductor
#>  BiocParallel               1.34.2    2023-05-22 [1] Bioconductor
#>  BiocSingular               1.16.0    2023-04-25 [1] Bioconductor
#>  BiocVersion                3.17.1    2022-11-04 [1] Bioconductor
#>  Biostrings               * 2.68.1    2023-05-16 [1] Bioconductor
#>  bit                        4.0.5     2022-11-15 [1] CRAN (R 4.3.1)
#>  bit64                      4.0.5     2020-08-30 [1] CRAN (R 4.3.1)
#>  bitops                     1.0-7     2021-04-24 [1] RSPM (R 4.3.1)
#>  blob                       1.2.4     2023-03-17 [1] CRAN (R 4.3.1)
#>  cachem                     1.0.8     2023-05-01 [1] CRAN (R 4.3.1)
#>  cli                        3.6.1     2023-03-23 [1] CRAN (R 4.3.1)
#>  cluster                    2.1.4     2022-08-22 [2] CRAN (R 4.3.1)
#>  codetools                  0.2-19    2023-02-01 [2] CRAN (R 4.3.1)
#>  colorspace                 2.1-0     2023-01-23 [1] CRAN (R 4.3.1)
#>  crayon                     1.5.2     2022-09-29 [1] CRAN (R 4.3.1)
#>  curatedMetagenomicData   * 3.8.0     2023-04-27 [1] Bioconductor
#>  curl                       5.0.2     2023-08-14 [1] RSPM (R 4.3.0)
#>  data.table                 1.14.8    2023-02-17 [1] CRAN (R 4.3.1)
#>  DBI                        1.1.3     2022-06-18 [1] CRAN (R 4.3.1)
#>  dbplyr                     2.3.3     2023-07-07 [1] CRAN (R 4.3.1)
#>  DECIPHER                   2.28.0    2023-04-25 [1] Bioconductor
#>  decontam                   1.20.0    2023-04-25 [1] Bioconductor
#>  DelayedArray               0.26.7    2023-07-28 [1] Bioconductor
#>  DelayedMatrixStats         1.22.5    2023-08-10 [1] Bioconductor
#>  digest                     0.6.33    2023-07-07 [1] CRAN (R 4.3.1)
#>  DirichletMultinomial       1.42.0    2023-04-25 [1] Bioconductor
#>  dplyr                      1.1.2     2023-04-20 [1] CRAN (R 4.3.1)
#>  ellipsis                   0.3.2     2021-04-29 [1] CRAN (R 4.3.1)
#>  evaluate                   0.21      2023-05-05 [1] CRAN (R 4.3.1)
#>  ExperimentHub              2.8.1     2023-07-12 [1] Bioconductor
#>  fansi                      1.0.4     2023-01-22 [1] CRAN (R 4.3.1)
#>  fastmap                    1.1.1     2023-02-24 [1] CRAN (R 4.3.1)
#>  filelock                   1.0.2     2018-10-05 [1] RSPM (R 4.3.1)
#>  fs                         1.6.3     2023-07-20 [1] CRAN (R 4.3.1)
#>  generics                   0.1.3     2022-07-05 [1] CRAN (R 4.3.1)
#>  GenomeInfoDb             * 1.36.1    2023-06-21 [1] Bioconductor
#>  GenomeInfoDbData           1.2.10    2023-08-08 [1] Bioconductor
#>  GenomicRanges            * 1.52.0    2023-04-25 [1] Bioconductor
#>  ggbeeswarm                 0.7.2     2023-04-29 [1] RSPM (R 4.3.1)
#>  ggplot2                    3.4.3     2023-08-14 [1] RSPM (R 4.3.0)
#>  ggrepel                    0.9.3     2023-02-03 [1] RSPM (R 4.3.1)
#>  glue                       1.6.2     2022-02-24 [1] CRAN (R 4.3.1)
#>  gridExtra                  2.3       2017-09-09 [1] RSPM (R 4.3.1)
#>  gtable                     0.3.3     2023-03-21 [1] CRAN (R 4.3.1)
#>  htmltools                  0.5.6     2023-08-10 [1] RSPM (R 4.3.0)
#>  htmlwidgets                1.6.2     2023-03-17 [1] CRAN (R 4.3.1)
#>  httpuv                     1.6.11    2023-05-11 [1] CRAN (R 4.3.1)
#>  httr                       1.4.6     2023-05-08 [1] CRAN (R 4.3.1)
#>  interactiveDisplayBase     1.38.0    2023-04-25 [1] Bioconductor
#>  IRanges                  * 2.34.1    2023-06-22 [1] Bioconductor
#>  irlba                      2.3.5.1   2022-10-03 [1] RSPM (R 4.3.1)
#>  jsonlite                   1.8.7     2023-06-29 [1] CRAN (R 4.3.1)
#>  KEGGREST                   1.40.0    2023-04-25 [1] Bioconductor
#>  knitr                      1.43      2023-05-25 [1] CRAN (R 4.3.1)
#>  later                      1.3.1     2023-05-02 [1] CRAN (R 4.3.1)
#>  lattice                    0.21-8    2023-04-05 [2] CRAN (R 4.3.1)
#>  lazyeval                   0.2.2     2019-03-15 [1] RSPM (R 4.3.1)
#>  lifecycle                  1.0.3     2022-10-07 [1] CRAN (R 4.3.1)
#>  magrittr                   2.0.3     2022-03-30 [1] CRAN (R 4.3.1)
#>  MASS                       7.3-60    2023-05-04 [2] CRAN (R 4.3.1)
#>  Matrix                     1.6-1     2023-08-14 [2] RSPM (R 4.3.0)
#>  MatrixGenerics           * 1.12.3    2023-07-30 [1] Bioconductor
#>  matrixStats              * 1.0.0     2023-06-02 [1] RSPM (R 4.3.1)
#>  memoise                    2.0.1     2021-11-26 [1] CRAN (R 4.3.1)
#>  mgcv                       1.9-0     2023-07-11 [2] RSPM (R 4.3.1)
#>  mia                        1.8.0     2023-04-25 [1] Bioconductor
#>  mime                       0.12      2021-09-28 [1] CRAN (R 4.3.1)
#>  MultiAssayExperiment       1.26.0    2023-04-25 [1] Bioconductor
#>  munsell                    0.5.0     2018-06-12 [1] CRAN (R 4.3.1)
#>  nlme                       3.1-163   2023-08-09 [2] RSPM (R 4.3.0)
#>  permute                    0.9-7     2022-01-27 [1] RSPM (R 4.3.1)
#>  pillar                     1.9.0     2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig                  2.0.3     2019-09-22 [1] CRAN (R 4.3.1)
#>  plotly                     4.10.2    2023-06-03 [1] RSPM (R 4.3.0)
#>  plyr                       1.8.8     2022-11-11 [1] RSPM (R 4.3.1)
#>  png                        0.1-8     2022-11-29 [1] RSPM (R 4.3.1)
#>  promises                   1.2.1     2023-08-10 [1] RSPM (R 4.3.0)
#>  purrr                      1.0.2     2023-08-10 [1] RSPM (R 4.3.0)
#>  R.cache                    0.16.0    2022-07-21 [1] RSPM (R 4.3.0)
#>  R.methodsS3                1.8.2     2022-06-13 [1] RSPM (R 4.3.0)
#>  R.oo                       1.25.0    2022-06-12 [1] RSPM (R 4.3.0)
#>  R.utils                    2.12.2    2022-11-11 [1] RSPM (R 4.3.0)
#>  R6                         2.5.1     2021-08-19 [1] CRAN (R 4.3.1)
#>  rappdirs                   0.3.3     2021-01-31 [1] CRAN (R 4.3.1)
#>  Rcpp                       1.0.11    2023-07-06 [1] CRAN (R 4.3.1)
#>  RCurl                      1.98-1.12 2023-03-27 [1] RSPM (R 4.3.1)
#>  reprex                     2.0.2     2022-08-17 [1] CRAN (R 4.3.1)
#>  reshape2                   1.4.4     2020-04-09 [1] RSPM (R 4.3.1)
#>  rlang                      1.1.1     2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown                  2.24      2023-08-14 [1] RSPM (R 4.3.0)
#>  RSQLite                    2.3.1     2023-04-03 [1] RSPM (R 4.3.1)
#>  rstudioapi                 0.15.0    2023-07-07 [1] CRAN (R 4.3.1)
#>  rsvd                       1.0.5     2021-04-16 [1] RSPM (R 4.3.1)
#>  S4Arrays                   1.0.5     2023-07-24 [1] Bioconductor
#>  S4Vectors                * 0.38.1    2023-05-02 [1] Bioconductor
#>  ScaledMatrix               1.8.1     2023-05-03 [1] Bioconductor
#>  scales                     1.2.1     2022-08-20 [1] CRAN (R 4.3.1)
#>  scater                     1.28.0    2023-04-25 [1] Bioconductor
#>  scuttle                    1.10.2    2023-08-03 [1] Bioconductor
#>  sessioninfo                1.2.2     2021-12-06 [1] CRAN (R 4.3.1)
#>  shiny                      1.7.5     2023-08-12 [1] RSPM (R 4.3.0)
#>  SingleCellExperiment     * 1.22.0    2023-04-25 [1] Bioconductor
#>  sparseMatrixStats          1.12.2    2023-07-02 [1] Bioconductor
#>  stringi                    1.7.12    2023-01-11 [1] CRAN (R 4.3.1)
#>  stringr                    1.5.0     2022-12-02 [1] CRAN (R 4.3.1)
#>  styler                     1.10.1    2023-06-05 [1] RSPM (R 4.3.0)
#>  SummarizedExperiment     * 1.30.2    2023-06-06 [1] Bioconductor
#>  tibble                     3.2.1     2023-03-20 [1] CRAN (R 4.3.1)
#>  tidyr                      1.3.0     2023-01-24 [1] CRAN (R 4.3.1)
#>  tidyselect                 1.2.0     2022-10-10 [1] CRAN (R 4.3.1)
#>  tidySummarizedExperiment * 1.10.0    2023-04-25 [1] Bioconductor
#>  tidytree                   0.4.5     2023-08-10 [1] RSPM (R 4.3.0)
#>  treeio                     1.24.3    2023-07-24 [1] Bioconductor
#>  TreeSummarizedExperiment * 2.8.0     2023-04-25 [1] Bioconductor
#>  utf8                       1.2.3     2023-01-31 [1] CRAN (R 4.3.1)
#>  vctrs                      0.6.3     2023-06-14 [1] CRAN (R 4.3.1)
#>  vegan                      2.6-4     2022-10-11 [1] RSPM (R 4.3.1)
#>  vipor                      0.4.5     2017-03-22 [1] RSPM (R 4.3.1)
#>  viridis                    0.6.4     2023-07-22 [1] RSPM (R 4.3.1)
#>  viridisLite                0.4.2     2023-05-02 [1] CRAN (R 4.3.1)
#>  withr                      2.5.0     2022-03-03 [1] CRAN (R 4.3.1)
#>  xfun                       0.40      2023-08-09 [1] RSPM (R 4.3.0)
#>  xtable                     1.8-4     2019-04-21 [1] CRAN (R 4.3.1)
#>  XVector                  * 0.40.0    2023-04-25 [1] Bioconductor
#>  yaml                       2.3.7     2023-01-23 [1] CRAN (R 4.3.1)
#>  yulab.utils                0.0.7     2023-08-09 [1] RSPM (R 4.3.0)
#>  zlibbioc                   1.46.0    2023-04-25 [1] Bioconductor
#> 
#>  [1] /home/user/R/x86_64-pc-linux-gnu-library/4.3
#>  [2] /home/user/apps/R-4.3.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2023-08-15 with reprex v2.0.2

sdgamboa commented 1 year ago

Related issue: https://github.com/stemangiola/tidySummarizedExperiment/pull/78

schifferl commented 12 months ago

I am not sure I even understand the problem @sdgamboa, but here is a quick solution.

library(curatedMetagenomicData)

tse <- 
    curatedMetagenomicData(
        pattern = "HallAB_2017.relative_abundance", 
        dryrun = FALSE,
        counts = TRUE,
        rownames = "NCBI"
    )[[1L]]

rownames(assay(tse, withDimnames = FALSE)) <-
    rownames(assay(tse, withDimnames = TRUE))

tidy_tse <-
    tidySummarizedExperiment::as_tibble(tse)

If you can provide some additional explanation or point me to the error in the programing, I'd be happy to fix it.

schifferl commented 12 months ago

Nevermind, I understand – this is resolved in 086d953