ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

get_names result order #78

Closed lisafisler closed 3 years ago

lisafisler commented 3 years ago

Hello,

My issue is quite simple: the get_names function gives me the correct names when I feed it with a species code (here "itis", but it's the same with "col") but the result is in a weird order. For example here "ITIS:715228", which gives the species Megapodius decollatus, appears as first element in the second request, although it should be second. This problem does not occur with the get_ids function which gives me the right order.

library(tidyverse) library(taxadb) td_create("itis") get_names("ITIS:715228") [1] "Megapodius decollatus" get_names(c("ITIS:553896", "ITIS:715228", NA)) [1] "Megapodius decollatus" "Falcipennis canadensis" NA

Thank you for your help with this issue.

For info, my sessionInfo() gives out:

R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 tibble_3.0.3
[8] ggplot2_3.3.2 tidyverse_1.3.0 taxadb_0.1.0

loaded via a namespace (and not attached): [1] progress_1.2.2 tidyselect_1.1.0 haven_2.3.1 colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2
[7] yaml_2.2.1 blob_1.2.1 rlang_0.4.7 pillar_1.4.6 glue_1.4.1 withr_2.2.0
[13] DBI_1.1.0 rappdirs_0.3.1 bit64_4.0.2 dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
[19] lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] curl_4.3 fansi_0.4.1 broom_0.7.0 arkdb_0.0.5 Rcpp_1.0.5 backports_1.1.8
[31] scales_1.1.1 jsonlite_1.7.0 fs_1.5.0 bit_4.0.4 hms_0.5.3 digest_0.6.25
[37] stringi_1.4.6 duckdb_0.2.1 grid_4.0.2 cli_2.0.2 tools_4.0.2 magrittr_1.5
[43] RSQLite_2.2.0 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 prettyunits_1.1.1 [49] reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11 R6_2.4.1
[55] compiler_4.0.2

cboettig commented 3 years ago

Wow, that's crazy! Sorry about that. I'm having trouble reproducing this.

Does it do that without the NA entry too?

Here's my sessionInfo:


> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxadb_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        pillar_1.4.6      compiler_4.0.2    dbplyr_1.4.4      R.methodsS3_1.8.1 prettyunits_1.1.1 R.utils_2.10.1    tools_4.0.2       progress_1.2.2    bit_4.0.4         digest_0.6.25     packrat_0.5.0     MonetDBLite_0.6.1 RSQLite_2.2.0     jsonlite_1.7.1    evaluate_0.14     memoise_1.1.0     lifecycle_0.2.0   tibble_3.0.3     
[20] pkgconfig_2.0.3   rlang_0.4.7       DBI_1.1.0         rstudioapi_0.11   curl_4.3          yaml_2.2.1        xfun_0.17         duckdb_0.2.1      arkdb_0.0.6       dplyr_1.0.2       knitr_1.29        rappdirs_0.3.1    generics_0.0.2    vctrs_0.3.4       hms_0.5.3         bit64_4.0.5       tidyselect_1.1.0  glue_1.4.2        R6_2.4.1         
[39] rmarkdown_2.3     readr_1.3.1       purrr_0.3.4       blob_1.2.1        magrittr_1.5      codetools_0.2-16  ellipsis_0.3.1    htmltools_0.5.0   assertthat_0.2.1  stringi_1.5.3     crayon_1.3.4      R.oo_1.24.0      
cboettig commented 3 years ago

p.s. you may already know this, but meanwhile use filter_id etc instead to get the full table, rather than rely on ordering in get_names(). You might try updating packages to the latest versions too (e.g. via update.packages()

lisafisler commented 3 years ago

It does unfortunately the same with or without the NA. I have updated all my possible packages, and no change.

Thanks yes, it works well with filter_id instead, even though it takes one more step to get to the information I want. Keep me posted if you can find the problem and I will work with filter_id meanwhile.

cboettig commented 3 years ago

Thanks, some database backends don't enforce consistent row-ordering. I've added an additional command to assert consistent order, can you please test again with the dev version?

remotes::install_github("ropensci/taxadb")
lisafisler commented 3 years ago

Great, thanks! It seems to have done the trick! Hurray :-)

The only trouble I see is that the get_names function seems a bit slower now than when I had the other version. It's only slightly slower, but as I can clearly see the difference with my small dataset of 3 species, I am just worried that it would much increase with a huge dataset. But maybe this second step will always take up the same amount of time, no matter how many species, in which case it wouldn't increase that much the time needed and that wouldn't be a problem in the end.

cboettig commented 3 years ago

thanks! Interesting that it's noticeably slower. I think you won't see that scale linearly with a very large number of names. Can you tell me what td_connect() shows?

If the speed of get_names is important to your workflow; you could be our first beta tester for https://github.com/cboettig/taxalight/ ? :blush:

lisafisler commented 3 years ago

It gives the result almost instantly with the original version, and it takes approximately 3 seconds for one get_names request with the dev version.

> td_connect() <duckdb_connection 27f60 driver=<duckdb_driver 05450 dbdir='/Users/lisafisler/Library/Application Support/taxadb/database/duckdb' read_only=FALSE>>

I don't really have a very big database, I was just concerned for people who do. But I'd be happy to test taxalight anyway! What are the main differences with taxadb?

cboettig commented 3 years ago

taxalight has only get_names() get_ids() and tl (which returns the taxonomic table for the requested species and/or ids in question). You can't do operations on the full taxonomic database with taxalight, like asking "how many names are in the family Aves". It's also stricter about the matching; i.e. a scientific name must match case exactly and there's no 'starts with' etc options.

At the moment it only has accepted taxonomic identifiers and scientific names available as queries. We can probably add query by common name and query by synonym identifier (for authorities that assign IDs to synonyms).

lisafisler commented 3 years ago

Thanks! I'll give it a go.