Closed sdgamboa closed 11 months ago
Pasting here a shortened version of the attribute sources name. Hope this helps to visualize the table.
suppressMessages({
library(bugphyzz)
library(dplyr)
library(purrr)
phys <- physiologies(keyword = 'all', full_source = FALSE)
})
madin_source <- 'https://github.com/bacteria-archaea-traits/bacteria-archaea-traits/tree/master/output/prepared_data'
phys <- phys |>
map(~ {
.x$Attribute_source <- ifelse(
.x$Attribute_source == madin_source, 'madin_et_al', .x$Attribute_source
)
.x
})
df <- phys |>
map(~ count(.x, Attribute_source, Evidence, Frequency)) |>
bind_rows(.id = 'spreadsheet') |>
arrange(Attribute_source, Evidence, Frequency)
df
#> spreadsheet Attribute_source
#> 1 health associated Asnicar_2021
#> 2 shape BacMap
#> 3 isolation site BacMap
#> 4 growth temperature BacMap
#> 5 habitat BacMap
#> 6 disease association BacMap
#> 7 aerophilicity BacMap
#> 8 arrangement BacMap
#> 9 arrangement BacMap
#> 10 gram stain BacMap
#> 11 shape BacMap
#> 12 acetate producing Barcenilla_2000
#> 13 butyrate producing Barcenilla_2000
#> 14 hydrogen gas producing Barcenilla_2000
#> 15 lactate producing Barcenilla_2000
#> 16 aerophilicity Bergey's Manual
#> 17 arrangement Bergey's Manual
#> 18 gram stain Bergey's Manual
#> 19 length Bergey's Manual
#> 20 shape Bergey's Manual
#> 21 spore shape Bergey's Manual
#> 22 width Bergey's Manual
#> 23 sphingolipid producing Bergey's Manual
#> 24 aerophilicity Bergey's Manual
#> 25 arrangement Bergey's Manual
#> 26 gram stain Bergey's Manual
#> 27 length Bergey's Manual
#> 28 shape Bergey's Manual
#> 29 spore shape Bergey's Manual
#> 30 width Bergey's Manual
#> 31 gram stain Bergey's Manual
#> 32 habitat Browne_2021
#> 33 spore shape Browne_2021
#> 34 mutation rate per site per generation Gibson_2018
#> 35 mutation rates per site per year Gibson_2018
#> 36 sphingolipid producing HeaverS_2018
#> 37 habitat Hilt_2014
#> 38 genome size Kegg
#> 39 coding genes Kegg
#> 40 habitat MiDAS
#> 41 growth medium Microbial Fatty Acid Compositions
#> 42 growth temperature Microbial Fatty Acid Compositions
#> 43 sphingolipid producing OlsenI_2001
#> 44 sphingolipid producing OlsenI_2001
#> 45 antimicrobial resistance PATRIC
#> 46 aerophilicity ProTraits
#> 47 arrangement ProTraits
#> 48 gram stain ProTraits
#> 49 habitat ProTraits
#> 50 shape ProTraits
#> 51 biofilm forming The Microbe Directory
#> 52 gram stain The Microbe Directory
#> 53 growth temperature The Microbe Directory
#> 54 habitat The Microbe Directory
#> 55 antimicrobial sensitivity The Microbe Directory
#> 56 optimal ph The Microbe Directory
#> 57 extreme environment The Microbe Directory
#> 58 animal pathogen The Microbe Directory
#> 59 gram stain The Microbe Directory
#> 60 COGEM pathogenicity rating The Microbe Directory
#> 61 plant pathogenicity The Microbe Directory
#> 62 isolation site madin_et_al
#> 63 aerophilicity madin_et_al
#> Evidence Frequency n
#> 1 igc usually 30
#> 2 unknown 6
#> 3 exp always 99
#> 4 exp unknown 535
#> 5 exp unknown 1438
#> 6 exp usually 445
#> 7 unknown always 1304
#> 8 unknown sometimes 24
#> 9 unknown unknown 1047
#> 10 unknown unknown 1278
#> 11 unknown unknown 1273
#> 12 exp usually 24
#> 13 exp usually 24
#> 14 exp usually 14
#> 15 exp usually 15
#> 16 exp always 1229
#> 17 exp always 142
#> 18 exp always 1315
#> 19 exp always 43
#> 20 exp always 876
#> 21 exp always 50
#> 22 exp always 117
#> 23 exp always 7
#> 24 exp sometimes 24
#> 25 exp sometimes 610
#> 26 exp sometimes 16
#> 27 exp sometimes 620
#> 28 exp sometimes 948
#> 29 exp sometimes 97
#> 30 exp sometimes 734
#> 31 unknown always 1
#> 32 exp unknown 1429
#> 33 exp usually 1388
#> 34 exp sometimes 26
#> 35 exp sometimes 81
#> 36 exp always 4
#> 37 exp sometimes 9
#> 38 igc usually 4665
#> 39 igc usually 4669
#> 40 exp usually 12999
#> 41 exp always 254
#> 42 exp unknown 641
#> 43 exp always 8
#> 44 exp sometimes 1
#> 45 igc always 10311
#> 46 igc always 5076
#> 47 igc always 5280
#> 48 igc always 3058
#> 49 igc always 64153
#> 50 igc always 12615
#> 51 exp unknown 426
#> 52 exp unknown 15
#> 53 exp unknown 1347
#> 54 exp unknown 2354
#> 55 exp usually 832
#> 56 exp usually 886
#> 57 unknown always 1874
#> 58 unknown unknown 1416
#> 59 unknown unknown 2337
#> 60 unknown usually 1042
#> 61 unknown usually 1493
#> 62 exp always 5316
#> 63 unknown always 3837
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R Under development (unstable) (2022-12-25 r83502)
#> os Pop!_OS 22.04 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-02-02
#> pandoc 2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ape 5.6-2 2022-03-02 [1] CRAN (R 4.3.0)
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.3.0)
#> bit 4.0.5 2022-11-15 [2] CRAN (R 4.3.0)
#> bit64 4.0.5 2020-08-30 [2] CRAN (R 4.3.0)
#> blob 1.2.3 2022-04-10 [2] CRAN (R 4.3.0)
#> bold 1.2.0 2021-05-11 [1] CRAN (R 4.3.0)
#> bugphyzz * 0.0.1.3 2023-02-02 [1] local
#> cachem 1.0.6 2021-08-19 [2] CRAN (R 4.3.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.3.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.3.0)
#> conditionz 0.1.0 2019-04-24 [1] CRAN (R 4.3.0)
#> crayon 1.5.2 2022-09-29 [2] CRAN (R 4.3.0)
#> crul 1.3 2022-09-03 [1] CRAN (R 4.3.0)
#> curl 5.0.0 2023-01-12 [2] CRAN (R 4.3.0)
#> data.table 1.14.6 2022-11-16 [2] CRAN (R 4.3.0)
#> DBI 1.1.3 2022-06-18 [2] CRAN (R 4.3.0)
#> dbplyr 2.3.0 2023-01-16 [2] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [2] CRAN (R 4.3.0)
#> dplyr * 1.1.0 2023-01-29 [2] CRAN (R 4.3.0)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.3.0)
#> evaluate 0.20 2023-01-17 [2] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [2] CRAN (R 4.3.0)
#> fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.3.0)
#> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.0)
#> fs 1.6.0 2023-01-23 [2] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.3.0)
#> hms 1.1.2 2022-08-19 [2] CRAN (R 4.3.0)
#> hoardr 0.5.3 2023-01-26 [1] CRAN (R 4.3.0)
#> htmltools 0.5.4 2022-12-07 [2] CRAN (R 4.3.0)
#> httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.3.0)
#> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.0)
#> jsonlite 1.8.4 2022-12-06 [2] CRAN (R 4.3.0)
#> knitr 1.42 2023-01-25 [2] CRAN (R 4.3.0)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.3.0)
#> mgsub 1.7.3 2021-07-28 [1] CRAN (R 4.3.0)
#> nlme 3.1-162 2023-01-31 [2] CRAN (R 4.3.0)
#> pillar 1.8.1 2022-08-19 [2] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.0)
#> plyr 1.8.8 2022-11-11 [1] CRAN (R 4.3.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.0)
#> rappdirs 0.3.3 2021-01-31 [2] CRAN (R 4.3.0)
#> Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
#> readr 2.1.3 2022-10-01 [2] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [2] CRAN (R 4.3.0)
#> reshape 0.8.9 2022-04-12 [1] CRAN (R 4.3.0)
#> rlang 1.0.6 2022-09-24 [2] CRAN (R 4.3.0)
#> rmarkdown 2.20 2023-01-19 [2] CRAN (R 4.3.0)
#> RSQLite 2.2.20 2022-12-22 [1] CRAN (R 4.3.0)
#> rstudioapi 0.14 2022-08-22 [2] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> stringi 1.7.12 2023-01-11 [2] CRAN (R 4.3.0)
#> stringr 1.5.0 2022-12-02 [2] CRAN (R 4.3.0)
#> styler 1.9.0 2023-01-15 [1] CRAN (R 4.3.0)
#> taxize 0.9.100 2022-04-22 [1] CRAN (R 4.3.0)
#> taxizedb 0.3.0 2021-01-15 [1] CRAN (R 4.3.0)
#> tibble 3.1.8 2022-07-22 [2] CRAN (R 4.3.0)
#> tidyr 1.3.0 2023-01-24 [2] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.3.0)
#> tzdb 0.3.0 2022-03-28 [2] CRAN (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [2] CRAN (R 4.3.0)
#> uuid 1.1-0 2022-04-19 [2] CRAN (R 4.3.0)
#> vctrs 0.5.2 2023-01-23 [2] CRAN (R 4.3.0)
#> vroom 1.6.1 2023-01-22 [2] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.3.0)
#> xfun 0.37 2023-01-31 [2] CRAN (R 4.3.0)
#> xml2 1.3.3 2021-11-30 [2] CRAN (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [2] CRAN (R 4.3.0)
#> zoo 1.8-11 2022-09-17 [1] CRAN (R 4.3.0)
#>
#> [1] /home/samuel/R/x86_64-pc-linux-gnu-library/4.3
#> [2] /home/samuel/Apps/R-devel/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Created on 2023-02-02 with reprex v2.0.2
So, it's ok if the evidence codes and the frequency don't match because they are not dependent on each other. They are independent variables. Does that answer the question? Thanks Samuel! @sdgamboa
But is it correct that the microbe directory has multiple values for both frequency and evidence? Can you point to where the microbe directory provided different types of evidence and different frequencies for different microbes?--
Levi Waldron
Associate Professor
Department of Epidemiology and Biostatistics
CUNY Graduate School of Public Health and Health Policy
Institute for Implementation Science in Population Health
55 W 125th St, New York NY 10035
Join the microbiome Virtual International Forum: https://microbiome-vif.org
The Microbe directory is only one evidence type, NAS. (not traceable author statement). But the attribute can occur at different frequencies. Like, size is always where optimal pH could be usually, since that has fluctuations to it.
Samuel's question was "@kbeckenrode, would you help review that the source, evidence, and frequency are all correct? I see that the microbe directory sometimes has 'unknown' or 'exp' in the Evidence column and 'usually', 'unknown', and 'always' in the Frequency column. Is this correct? Some cells are empty and a similar pattern is found for other attribute sources.
Do we need a unit test for this? If so, how should it look like?"
I'm still unclear what your answer is to this question. It seems to that your answer is, the evidence codes for microbe directory were incorrect, and for other sources you haven't checked the output above for correctness.
For unit tests: frequency and evidence codes should be non-missing and from a list of allowable values. Evidence codes should come from the table of sources that provides "confidence in curation" instead of from the attribute sheets. The old column "confidence interval" should NOT be present.
@lwaldron Ah ok. I understand now. Ill go ahead and double check the frequency and evidence codes to make sure they are correct. I'll also make sure to remove confidence interval.
Evidence and confidence in curation are reported here: https://github.com/waldronlab/bugphyzz/blob/devel/inst/extdata/attribute_sources.tsv
Frequency values are reassessed in bugphyzzExports, so it should not be a problem.
@kbeckenrode, would you help review that the source, evidence, and frequency are all correct? I see that the microbe directory sometimes has 'unknown' or 'exp' in the Evidence column and 'usually', 'unknown', and 'always' in the Frequency column. Is this correct? Some cells are empty and a similar pattern is found for other attribute sources.
Do we need a unit test for this? If so, how should it look like?
Also, would you add the full reference of HeaverS_2018 and OlsenI_2001 in the 'full_source' column of this file: https://github.com/waldronlab/bugphyzz/blob/main/inst/extdata/attribute_sources.tsv?
I think the table is a little hard to see here because I added the full source. Probably need to run the code I pasted here on your machine and visualize the table with View().
Created on 2023-02-02 with reprex v2.0.2