ropensci / taxadb

:package: Taxonomic Database
https://docs.ropensci.org/taxadb
Other
43 stars 13 forks source link

`td_create` Failed to connect #91

Closed DanieleVeri closed 2 years ago

DanieleVeri commented 3 years ago

image

I tried also specifying version=2019 but I got the same error.

R version: 4.0.5 RStudio version: 1.4.1106

cboettig commented 3 years ago

Thanks for reporting. Try updating contentid to v 0.0.10 ( install.packages("contentid") should do it, though RSPM mirrors might be a day behind so install.packages("contentid", repos="https://cran.r-project.org") should have the latest)

lucas-jardim commented 3 years ago

It did not work. The error ("Error in curl::curl_fetch_memory(file, handle): Could not resolve host: hash-archive.org") continues even in 0.0.10 version.

kguidonimartins commented 3 years ago

+1

pic-selected-210428-1843-11

lucas-jardim commented 3 years ago

contentid::resolve is not finding the hash.

error.pdf

cboettig commented 3 years ago

Thanks folks and apologies. Some sources like itis are resolvable to by other hosts, but it looks like GBIF is not. While https://hash-archive.org is still down we can just switch to a different default resolver, e.g. try:

Sys.setenv("CONTENTID_REGISTRIES"  = "https://hash-archive.thelio.carlboettiger.info")
td_create("gbif")

once hash-archive is back up the default should work again, but meanwhile the above should get things working.

lucas-jardim commented 3 years ago

Great! Thank you!

kguidonimartins commented 3 years ago

Thanks, Carl!

Everything working now.

skeyser commented 2 years ago

Hi Carl,

I am also having some issues with td_create. When running the following code I receive an error.

library("taxadb") td_create("col")

Error: Error in switch(compression, gzip = gzfile(path, ...), bz2 = bzfile(path, :
EXPR must be a length 1 vector In addition: Warning messages: 1: In FUN(X[[i]], ...) : No sources found for hash://sha256/1b5941879daf2e771bceaea533b3b5271117531fde20131be0a7f970a59b1a23 2: In FUN(X[[i]], ...) : No sources found for hash://sha256/6aa4beb2feafbb599e2f58621aa0700974a00199f89f543d7e8b8b283c074e27

Any ideas? Taxadb was working fine in my work flow prior to updating to R 4.1.0

cboettig commented 2 years ago

Try this:

Sys.setenv("CONTENTID_REGISTRIES"  = "https://hash-archive.carlboettiger.info")
td_create("col")

I'll release a new version soon that should include more backup registries to avert this issue when the main registry, https://hash-archive.org, is unreachable.

joelnitta commented 2 years ago

The solution posted by @cboettig on April 29 seems to not be working for GBIF anymore (neither does just trying td_create("gbif") without setting CONTENTID_REGISTRIES):

library(taxadb)

Sys.setenv("CONTENTID_REGISTRIES"  = "https://hash-archive.thelio.carlboettiger.info")
td_create("gbif")
#> Warning in FUN(X[[i]], ...): No sources found for hash://
#> sha256/367cf3b0501efb32005de26ea093bedd43d5d7e759b4b3929a3a884927274925
#> Warning in FUN(X[[i]], ...): No sources found for hash://
#> sha256/0d0a61f8122f1a13cb1fcdb8e13065f5d95071873c84639cc2e39fa131ce99e6
#> Error in initialize(value, ...): duckdb_startup_R: Failed to open database: IO Error: Could not set lock on file "/Users/joelnitta/Library/Application Support/taxadb/duckdb": Resource temporarily unavailable

Created on 2021-10-14 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.0 (2021-05-18) #> os macOS Catalina 10.15.7 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Asia/Tokyo #> date 2021-10-14 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> arkdb 0.0.12 2021-04-05 [1] CRAN (R 4.1.0) #> askpass 1.1 2019-01-13 [1] CRAN (R 4.1.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) #> blob 1.2.2 2021-07-23 [1] CRAN (R 4.1.0) #> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> contentid 0.0.12 2021-08-08 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) #> digest 0.6.28 2021-09-23 [1] CRAN (R 4.1.0) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> duckdb 0.3.0 2021-10-08 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0) #> knitr 1.36 2021-09-29 [1] CRAN (R 4.1.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) #> openssl 1.4.5 2021-09-02 [1] CRAN (R 4.1.0) #> pillar 1.6.3 2021-09-26 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) #> progress 1.2.2 2019-05-16 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0) #> R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.1.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0) #> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> readr 2.0.2 2021-09-27 [1] CRAN (R 4.1.0) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0) #> RSQLite 2.2.8 2021-08-21 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> stringi 1.7.5 2021-10-04 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.0) #> taxadb * 0.1.3 2021-10-13 [1] Github (ropensci/taxadb@d003b2e) #> tibble 3.1.5 2021-09-30 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.26 2021-09-14 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library ```
cboettig commented 2 years ago

@joelnitta apologies, we've been trying to update some plumbing on the data access. Can you try installing:

remotes::install_github("cboettig/contentid")

restart R, and see if it resolves the issue?

joelnitta commented 2 years ago

Seems to be working now, thanks!

ScaonE commented 2 years ago

Hello,

Using R 4.1.2, I seem to have the same issue.

I did the following:

devtools::install_github("ropensci/taxadb")
remotes::install_github("cboettig/contentid")
Sys.setenv("CONTENTID_REGISTRIES" = "https://hash-archive.carlboettiger.info")

But when trying to run a command from the README: get_ids("Trochalopteron henrici gucenense")

I get:

Warning in FUN(X[[i]], ...) : No sources found for hash://sha256/ae98e3de1cadd69c064aa7aeb26b89251b49926207d41f8185af28d5f7a8853d Native bulk importer found, attempting fast import of NA Native import failed, falling back on R-based parser Error in switch(compression, gzip = gzfile(path, ...), bz2 = bzfile(path, : EXPR must be a length 1 vector

cboettig commented 2 years ago

apologies! I've been moving a few things around, should have this back up soon

ScaonE commented 2 years ago

No problem.

Atm if I try: get_ids("Trochalopteron henrici gucenense")

It doesn't return an error anymore, but "NA".

Quick question: Should I exepect different results from taxadb::get_ids() & taxize::get_ids()?

cboettig commented 2 years ago

@ScaonE thanks for the follow-up. Yes, NA is the expected value (though there should be a warning about multiple matches that isn't showing up at the moment).

Yes, you should in general expect different results from taxadb::get_ids() and taxize::get_ids() -- taxadb will return precisely one output for every input name you give it, and it will return NA when the query is ambiguous.

If you use filter_name(), you will see that this name, according to the default provider ITIS which you are querying by default, is a synonym for two different recognized speices:

taxadb::filter_name("Trochalopteron henrici gucenense") %>% select(taxonID, scientificName, taxonomicStatus, taxonRank, acceptedNameUsageID, genus, specificEpithet)
# A tibble: 2 × 7
  taxonID     scientificName                   taxonomicStatus taxonRank  acceptedNameUsageID genus          specificEpithet
  <chr>       <chr>                            <chr>           <chr>      <chr>               <chr>          <chr>          
1 ITIS:924962 trochalopteron henrici gucenense synonym         subspecies ITIS:916116         Trochalopteron elliotii       
2 ITIS:924962 trochalopteron henrici gucenense synonym         subspecies ITIS:916117         Trochalopteron henrici 

(I've selected a subset of columns for simplicity).

We get back two rows for this single name, because ITIS recognizes this sicentific name as a synonym) used to describe what it recognizes as two distinct species, Trochalopteron elliotii and Trochalopteron henrici, aka ITIS:916116 and ITIS:916117 respectively. get_ids() intentionally does not return both IDs, since that is a recipe for mistakes -- e.g. it could artificially inflate the number of species in the data, and leads to problems when users request multiple species (a vector of species) in a single get_ids() call. Throwing an NA seems the best choice here.

Note that ITIS also assigns identifiers to the synonyms themselves, this synonym is known as ITIS:924962. Some users might therefore expect this to be the ID returned by this query. We do not do that in taxadb because many other providers do not assign identifiers at all to synonyms, but only to accepted names. By default, taxadb::get_ids() will always return acceptedNameUsageID identifiers. If the name you use as input is an accepted name, then it's taxonID and acceptedNameUsageID are the same ID (since the name is accepted). If it is a synonym, we give the accepted ID. You will need to use filter_name() as above to address multiple matches. Note that this may require further research, and the 'correct' species designation may depend on your research context. (e.g. in this case, this name appears to have been used by the IOC World Bird List, but is considered by ITIS to be hybrid between the two recognized species, not a member of a taxonomically valid subspecies population of either.)

If you try taxize::get_ids() on this, I believe you just get an error (probably due to the multiple matches). taxize also tries to query multiple providers (ITIS, NCBI, EOL, etc) at once, whereas taxadb requires you specify a single naming provider. This is important because there is no guarantee taxonomic names are consistent across providers -- what is a synonym according to one naming provider is an accepted name to another, etc. Any collection of taxonomic names is essentially a theory of taxonomy, not a simple observable data point, and reasonable experts can disagree about classification. An advantage of using IDs in the first place instead of scientific names is that the researcher declares which naming provider the researcher has in mind -- i.e. it embed the context that we don't have when given a name such as "Trochalopteron henrici gucenens".

ScaonE commented 2 years ago

@cboettig, thank you for the detailed answer, it does make more sense to me now.

To be honest, I'm just trying to find an R alternative to the excellent name2taxid command from taxonkit.

My shortlist of functions to test are:

If possible, keep us posted when the taxadb::get_ids() go back to expected results. If I take another example command from the vignette:

taxadb::get_ids(c("Midas bicolor", "Homo sapiens"), format = "prefix")

It still returns NA NA for me atm.

Best regards

cboettig commented 2 years ago

@ScaonE Thanks for the follow-up, details like you provide really help clarify users needs and make this a better resource for everyone.

Yeah, apologies that the NAs are frustrating. That should be fixed now, e.g.

 taxadb::get_ids(c("Midas bicolor", "Homo sapiens"), format = "prefix")
[1] "ITIS:572923" "ITIS:180092"

(The previous version had lower-cased species names, as you would see if you used the function taxadb::filter_name(), which provides more information as detailed above).

To throw another option in there, you should get faster performance with the taxalight package:

 taxalight::get_ids(c("Midas bicolor", "Homo sapiens"))
[1] "ITIS:572923" "ITIS:180092"

It looks like taxonkit is an NCBI product? You probably want to compare to the NCBI names instead of ITIS names. Note that NCBI won't recognize all the same names, e.g. "Midas bicolor" isn't an accepted species name, and though ITIS recognizes it as an invalid synonym for the actual speices, Saguinus bicolor, and maps it's id accordingly, NCBI will simply return NA for Midas bicolor.

Depending on the taxonomic group of interest, you may find that NCBI and ITIS can resolve far fewer total species names than larger providers of taxonomic data such as COL or OTT.

ScaonE commented 2 years ago

Again, thanks for this answer.

To make things clear on my end: My inputs are taxonomic abundance tables which have been assigned vs NCBI DBs. My goal is to retrieve associated NBCI taxids (to end up converting them to the bioboxes format). Notice the column "TAXPATH" in the bioboxes format, within which you need NCBI taxids for the entire lineage. My current script to convert to the bioboxes format is working, but I'd like to avoid the non-R step (aka taxonkit).

That should be fixed now

Yes it is on my end:

taxadb::get_ids(
  names = c("Midas bicolor", "Homo sapiens"),
  db = "itis"
  )

[1] "ITIS:572923" "ITIS:180092"

If I try to switch to "ncbi" database:

taxadb::get_ids(
  names = c("Midas bicolor", "Homo sapiens"),
  db = "ncbi"
  )

[1] NA NA

taxalight::tl_create("ncbi")
taxalight::get_ids(
  name = c("Midas bicolor", "Homo sapiens"),
  provider  = "ncbi"
)

Midas bicolor Homo sapiens NA NA

It looks like taxonkit is an NCBI product? You probably want to compare to the NCBI names instead of ITIS names. Note that NCBI won't recognize all the same names, e.g. "Midas bicolor" isn't an accepted species name

For comparison, below the results of taxonkit::name2taxid() on the same names:

more test.txt

Midas bicolor Homo sapiens

cat test.txt | taxonkit name2taxid

Midas bicolor Homo sapiens 9606

cboettig commented 2 years ago

apologies, names are still lower-cased in the ncbi-table. Once again, you'll see that taxadb::filter_name is a bit more robust to this case:

taxadb::filter_name(
  name = c("Midas bicolor", "Homo sapiens"),
  provider  = "ncbi"
)

again, having a table format returned indicating name and column also helps with multiple matches.

Currently lower-case names should work in NCBI names in either taxadb or taxalight (sorry, that should be fixed soon, updating NCBI has been delayed since they have somewhat recently altered their export format).

taxalight::get_ids("homo sapiens",
+   provider  = "ncbi"
+ )
homo sapiens 
 "NCBI:9606" 

I'll report back when the new NCBI names are up.

cboettig commented 2 years ago

NCBI names should be back up now.

You will probably have to purge your earlier local database to remove the NCBI copy that had lowercase names, e.g.

fs::dir_delete(taxadb::taxadb_dir())

or for taxalight,

fs::dir_delete(taxadb::taxalight_dir())

and then the package should rebuild the local database with the most recent NCBI data. please re-open or open a new issue if there's any further unexpected behavior, and thanks again for taking the time to report these, it's much appreciated.