waldronlab / curatedMetagenomicDataCuration

Sample Metadata Curation for curatedMetagenomicData
https://waldronlab.io/curatedMetagenomicDataCuration/
28 stars 23 forks source link

age doesn't correspond to age_category in some samples #65

Closed sdgamboa closed 2 years ago

sdgamboa commented 2 years ago

Some samples are annotated with an age that doesn't correspond to their age_category. For example, one sample is labeled as "newborn" and has an age of 19.

Reprex:

suppressMessages({
    library(curatedMetagenomicData)
    library(dplyr)
    library(ggplot2)
})

data <- sampleMetadata %>% 
    filter(!is.na(age), !is.na(age_category)) %>% 
    as_tibble()
unique(data$age_category)
#> [1] "newborn"   "child"     "adult"     "schoolage" "senior"
data$age_category <- factor(
    data$age_category,
    levels = c("newborn", "child", "schoolage", "adult", "senior")
)
ggplot(data, aes(age_category, age)) +
    geom_point(position = "jitter", size = 1)

## samples labeled as "newborn" with age above 1
data[data$age_category == "newborn" & data$age > 1,] %>% 
    select(study_name, sample_id, subject_id, age_category, age)
#> # A tibble: 1 × 5
#>   study_name sample_id                          subject_id age_category   age
#>   <chr>      <chr>                              <chr>      <fct>        <int>
#> 1 ChuDM_2017 NCS-049-Stool-maternal3_microbiome NCS049_mo  newborn         19

## samples labeled as "child" with age above 11
data[data$age_category == "child" & data$age > 11, ] %>% 
    select(study_name, sample_id, subject_id, age_category, age)
#> # A tibble: 1 × 5
#>   study_name     sample_id  subject_id age_category   age
#>   <chr>          <chr>      <chr>      <fct>        <int>
#> 1 PehrssonE_2016 SID03C_000 SID03C     child           19

## samples labeled as "adult" with age above 65
data[data$age_category == "adult" & data$age >= 66, ] %>% 
    select(study_name, sample_id, subject_id, age_category, age)
#> # A tibble: 19 × 5
#>    study_name      sample_id subject_id          age_category   age
#>    <chr>           <chr>     <chr>               <fct>        <int>
#>  1 HanniganGD_2017 MG100206  HanniganGD_2017_A27 adult           68
#>  2 HanniganGD_2017 MG100205  HanniganGD_2017_A26 adult           80
#>  3 HanniganGD_2017 MG100203  HanniganGD_2017_A24 adult           67
#>  4 HanniganGD_2017 MG100201  HanniganGD_2017_A22 adult           68
#>  5 HanniganGD_2017 MG100193  HanniganGD_2017_A14 adult           72
#>  6 HanniganGD_2017 MG100191  HanniganGD_2017_A12 adult           82
#>  7 HanniganGD_2017 MG100188  HanniganGD_2017_A09 adult           73
#>  8 HanniganGD_2017 MG100187  HanniganGD_2017_A07 adult           71
#>  9 HanniganGD_2017 MG100181  HanniganGD_2017_A01 adult           76
#> 10 HanniganGD_2017 MG100171  HanniganGD_2017_H20 adult           75
#> 11 HanniganGD_2017 MG100158  HanniganGD_2017_H06 adult           75
#> 12 HanniganGD_2017 MG100154  HanniganGD_2017_H02 adult           69
#> 13 HanniganGD_2017 MG100150  HanniganGD_2017_C28 adult           73
#> 14 HanniganGD_2017 MG100142  HanniganGD_2017_C19 adult           71
#> 15 HanniganGD_2017 MG100133  HanniganGD_2017_C10 adult           69
#> 16 HanniganGD_2017 MG100130  HanniganGD_2017_C06 adult           69
#> 17 HanniganGD_2017 MG100128  HanniganGD_2017_C04 adult           88
#> 18 HanniganGD_2017 MG100127  HanniganGD_2017_C03 adult           69
#> 19 HanniganGD_2017 MG100125  HanniganGD_2017_C01 adult           70

## samples labeled as "senior" with age < 66
data[data$age_category == "senior" & data$age < 66, ] %>% 
    select(study_name, sample_id, subject_id, age_category, age)
#> # A tibble: 1 × 5
#>   study_name         sample_id               subject_id       age_category   age
#>   <chr>              <chr>                   <chr>            <fct>        <int>
#> 1 LifeLinesDeep_2016 EGAR00001420886_900200… sub_90020000014… senior          19

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/usuario/Apps/R-4.1.1/lib/libRblas.so
#> LAPACK: /home/usuario/Apps/R-4.1.1/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.3.5                  dplyr_1.0.7                   
#>  [3] curatedMetagenomicData_3.3.3   TreeSummarizedExperiment_2.2.0
#>  [5] Biostrings_2.62.0              XVector_0.34.0                
#>  [7] SingleCellExperiment_1.16.0    SummarizedExperiment_1.24.0   
#>  [9] Biobase_2.54.0                 GenomicRanges_1.46.1          
#> [11] GenomeInfoDb_1.30.0            IRanges_2.28.0                
#> [13] S4Vectors_0.32.3               BiocGenerics_0.40.0           
#> [15] MatrixGenerics_1.6.0           matrixStats_0.61.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] backports_1.4.1               AnnotationHub_3.2.0          
#>   [3] BiocFileCache_2.2.0           plyr_1.8.6                   
#>   [5] lazyeval_0.2.2                splines_4.1.1                
#>   [7] BiocParallel_1.28.3           scater_1.22.0                
#>   [9] digest_0.6.29                 yulab.utils_0.0.4            
#>  [11] htmltools_0.5.2               viridis_0.6.2                
#>  [13] fansi_1.0.0                   magrittr_2.0.1               
#>  [15] memoise_2.0.1                 ScaledMatrix_1.2.0           
#>  [17] cluster_2.1.2                 DECIPHER_2.22.0              
#>  [19] R.utils_2.11.0                colorspace_2.0-2             
#>  [21] blob_1.2.2                    rappdirs_0.3.3               
#>  [23] ggrepel_0.9.1                 xfun_0.29                    
#>  [25] crayon_1.4.2                  RCurl_1.98-1.5               
#>  [27] jsonlite_1.7.2                ape_5.6-1                    
#>  [29] glue_1.6.0                    gtable_0.3.0                 
#>  [31] zlibbioc_1.40.0               DelayedArray_0.20.0          
#>  [33] R.cache_0.15.0                BiocSingular_1.10.0          
#>  [35] scales_1.1.1                  DBI_1.1.2                    
#>  [37] Rcpp_1.0.8                    viridisLite_0.4.0            
#>  [39] xtable_1.8-4                  decontam_1.14.0              
#>  [41] tidytree_0.3.7                bit_4.0.4                    
#>  [43] rsvd_1.0.5                    httr_1.4.2                   
#>  [45] ellipsis_0.3.2                pkgconfig_2.0.3              
#>  [47] R.methodsS3_1.8.1             farver_2.1.0                 
#>  [49] scuttle_1.4.0                 dbplyr_2.1.1                 
#>  [51] utf8_1.2.2                    tidyselect_1.1.1             
#>  [53] labeling_0.4.2                rlang_0.4.12                 
#>  [55] reshape2_1.4.4                later_1.3.0                  
#>  [57] AnnotationDbi_1.56.2          munsell_0.5.0                
#>  [59] BiocVersion_3.14.0            tools_4.1.1                  
#>  [61] cachem_1.0.6                  cli_3.1.0                    
#>  [63] DirichletMultinomial_1.36.0   generics_0.1.1               
#>  [65] RSQLite_2.2.9                 ExperimentHub_2.2.0          
#>  [67] mia_1.2.3                     evaluate_0.14                
#>  [69] stringr_1.4.0                 fastmap_1.1.0                
#>  [71] yaml_2.2.1                    knitr_1.37                   
#>  [73] bit64_4.0.5                   fs_1.5.2                     
#>  [75] purrr_0.3.4                   KEGGREST_1.34.0              
#>  [77] nlme_3.1-153                  sparseMatrixStats_1.6.0      
#>  [79] mime_0.12                     R.oo_1.24.0                  
#>  [81] rstudioapi_0.13               compiler_4.1.1               
#>  [83] beeswarm_0.4.0                filelock_1.0.2               
#>  [85] curl_4.3.2                    png_0.1-7                    
#>  [87] interactiveDisplayBase_1.32.0 reprex_2.0.1                 
#>  [89] treeio_1.18.1                 tibble_3.1.6                 
#>  [91] stringi_1.7.6                 highr_0.9                    
#>  [93] lattice_0.20-45               Matrix_1.4-0                 
#>  [95] styler_1.6.2                  vegan_2.5-7                  
#>  [97] permute_0.9-5                 vctrs_0.3.8                  
#>  [99] pillar_1.6.4                  lifecycle_1.0.1              
#> [101] BiocManager_1.30.16           BiocNeighbors_1.12.0         
#> [103] bitops_1.0-7                  irlba_2.3.5                  
#> [105] httpuv_1.6.5                  R6_2.5.1                     
#> [107] promises_1.2.0.1              gridExtra_2.3                
#> [109] vipor_0.4.5                   MASS_7.3-54                  
#> [111] assertthat_0.2.1              withr_2.4.3                  
#> [113] GenomeInfoDbData_1.2.7        mgcv_1.8-38                  
#> [115] parallel_4.1.1                MultiAssayExperiment_1.20.0  
#> [117] grid_4.1.1                    beachmat_2.10.0              
#> [119] tidyr_1.1.4                   rmarkdown_2.11               
#> [121] DelayedMatrixStats_1.16.0     shiny_1.7.1                  
#> [123] ggbeeswarm_0.6.0

Created on 2022-01-14 by the reprex package (v2.0.1)

lwaldron commented 2 years ago

Pinging @paolinomanghi - would be great to fix this before the Bioconductor 3.15 release.