ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

`taxize` and `taxizedb` return different `tibble` structures with `children` when `db = "itis"` #78

Open KaiAragaki opened 5 months ago

KaiAragaki commented 5 months ago

Happy to make a PR to attempt to fix this if you'd like, but since this could be a breaking change I thought I'd get your eyes on it first.

I suppose that since taxizedb is a drop-in replacement for taxize, it would probably be best to conform to whatever taxize returns - but that is obviously your call.

taxize

taxize::children(145395, db = "itis")
$`145395`
# A tibble: 2 × 5
  parentname  parenttsn rankname taxonname              tsn   
  <chr>       <chr>     <chr>    <chr>                  <chr> 
1 Toropamecia 145395    Species  Toropamecia punctata   145396
2 Toropamecia 145395    Species  Toropamecia reticulata 145398

attr(,"class")
[1] "children"
attr(,"db")
[1] "itis"

taxizedb

r$> taxizedb::children(145395, db = "itis")
$`145395`
# A tibble: 2 × 4
      id rank_id name                   rank   
   <int>   <int> <chr>                  <chr>  
1 145396     220 Toropamecia punctata   species
2 145398     220 Toropamecia reticulata species

attr(,"class")
[1] "children"
attr(,"db")
[1] "itis"
Session Info ```r R version 4.3.2 (2023-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 21.1 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] testthat_3.2.1 devtools_2.4.5 usethis_2.2.3 loaded via a namespace (and not attached): [1] htmlwidgets_1.6.4 remotes_2.4.2.1 lattice_0.22-5 vctrs_0.6.5 tools_4.3.2 generics_0.1.3 [7] curl_5.2.1 parallel_4.3.2 RSQLite_2.3.5 tibble_3.2.1 fansi_1.0.6 blob_1.2.4 [13] pkgconfig_2.0.3 data.table_1.15.0 dbplyr_2.4.0 uuid_1.2-0 lifecycle_1.0.4 conditionz_0.1.0 [19] compiler_4.3.2 stringr_1.5.1 brio_1.1.4 taxizedb_0.3.1 ritis_1.0.0 codetools_0.2-19 [25] httpuv_1.6.14 htmltools_0.5.7 later_1.3.2 pillar_1.9.0 crayon_1.5.2 urlchecker_1.0.1 [31] ellipsis_0.3.2 solrium_1.2.0 cachem_1.0.8 sessioninfo_1.2.2 iterators_1.0.14 foreach_1.5.2 [37] nlme_3.1-163 mime_0.12 tidyselect_1.2.0 digest_0.6.34 stringi_1.8.3 dplyr_1.1.4 [43] purrr_1.0.2 fastmap_1.1.1 grid_4.3.2 cli_3.6.2 magrittr_2.0.3 triebeard_0.4.1 [49] bold_1.3.0 crul_1.4.0 pkgbuild_1.4.3 utf8_1.2.4 ape_5.7-1 withr_3.0.0 [55] rappdirs_0.3.3 promises_1.2.1 bit64_4.0.5 bit_4.0.5 zoo_1.8-12 memoise_2.0.1 [61] shiny_1.8.0 taxize_0.9.100 miniUI_0.1.1.1 hoardr_0.5.4 urltools_1.7.3 profvis_0.3.8 [67] rlang_1.1.3 Rcpp_1.0.12 DBI_1.2.2 xtable_1.8-4 glue_1.7.0 httpcode_0.3.0 [73] xml2_1.3.6 pkgload_1.3.4 jsonlite_1.8.8 R6_2.5.1 plyr_1.8.9 fs_1.6.3 ```
KaiAragaki commented 4 months ago

The only intersects between taxize and taxizedb databases are ncbi and itis (keeping in mind #80, which excludes worms and bold).

A deeper comparison:

NCBI

ITIS

where id = tsn, name = taxonname, and rank is similar to rankname (capitalization differs). rank_id has no equivalent.

It's your call if you think that the lack of harmony is a bug or a feature - frankly I quite like the standard interface of names that taxizedb provides, but if most people who use taxizedb are those moving from taxize, a more harmonized solution might be preferable.

stitam commented 4 months ago

Thanks @KaiAragaki for opening this issue.

It would be nice if taxize and taxizedb user interfaces were more harmonised but it's unclear to me whether the extra effort from the two teams to maintain this harmony would be justified. Currently these are independent projects.

What I think "is" an issue here is that the output of taxizedb::children() should have the same structure regardless of db. However, db = "itis" returns an extra column which I think is not needed. A named list with three columns, "id", "name", "rank", in this order, regardless of db would probably be a more streamlined behaviour. What do you think?

I'm okay with breaking changes, CRAN does not list any reverse dependencies and the current version number clearly indicates that taxizedb is under development.