ropensci / rdatacite

Wrapper to DataCite metadata
https://docs.ropensci.org/rdatacite
Other
25 stars 3 forks source link

Error in readBin: How to find problematic metdataset at source? #27

Closed katrinleinweber closed 4 years ago

katrinleinweber commented 4 years ago

I'm running into a problem when downloading GBIF's metadata records:

> dc_works("prefix:10.15468", rows = 99999L)
Error in readBin(x, character()) :
  
R character strings are limited to 2^31-1 bytes

I'm guessing that's because they submitted a very large file encoded in their JSON/XML upload to DataCite. Is there a more elegant way of finding out which DOI is the problematic one, than:

  1. bisecting via the rows parameter combined with a given order,
  2. skipping the problematic row and downloading a few more with offset = row+1
  3. looking at the gap in date or DOI and trying to find the missing item on GBIF.org?
Session Info ```r ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.6.2 (2019-12-12) os macOS Catalina 10.15.2 system x86_64, darwin15.6.0 ui RStudio language en collate en_US.UTF-8 ctype en_US.UTF-8 tz Europe/Copenhagen date 2019-12-15 ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────── ! package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0) callr 3.4.0 2019-12-09 [1] CRAN (R 3.6.0) cli 2.0.0 2019-12-09 [1] CRAN (R 3.6.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) crul 0.9.0 2019-11-06 [1] CRAN (R 3.6.0) curl 4.3 2019-12-02 [1] CRAN (R 3.6.0) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) devtools * 2.2.1 2019-09-24 [1] CRAN (R 3.6.0) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.0) dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0) fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0) R fd * 0.1.0 [?] fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0) glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) hms 0.5.2 2019-10-30 [1] CRAN (R 3.6.0) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.6.0) httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0) knitr 1.26 2019-11-12 [1] CRAN (R 3.6.0) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0) lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.0) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0) magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) oai 0.3.0 2019-09-07 [1] CRAN (R 3.6.0) pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.0) pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) plyr 1.8.5 2019-12-10 [1] CRAN (R 3.6.0) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0) processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0) ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.0) R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0) Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0) rdatacite 0.4.2 2019-05-07 [1] CRAN (R 3.6.0) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0) remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0) scales 1.1.0 2019-11-18 [1] CRAN (R 3.6.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) solrium 1.1.4 2019-11-02 [1] CRAN (R 3.6.0) stringi * 1.4.3 2019-03-12 [1] CRAN (R 3.6.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) testthat * 2.3.1 2019-12-01 [1] CRAN (R 3.6.0) tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.6.0) urltools 1.7.3 2019-04-14 [1] CRAN (R 3.6.0) usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.6.0) vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.0) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) xfun 0.11 2019-11-12 [1] CRAN (R 3.6.0) xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.0) zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0) [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library R ── Package was removed from disk ```
sckott commented 4 years ago

thanks for the report @katrinleinweber - having a look

katrinleinweber commented 4 years ago

Thank you :-)

Shortly afterwards, I also started seeing: Error in solr_error(res) : (504) Gateway Timeout - The gateway server did not receive a timely response, and colleagues of mine also get the 504 error on the web-interface they use.

Nothing is written about this on https://status.datacite.org/ as of now, but I hear that a database upgrade/migration is being conducted. I presume the 504 error is related to that, not to the above-described GBIF issue.

cc @mfenner

sckott commented 4 years ago

looks like the xml field in the output can be very very large, causing some issues

sckott commented 4 years ago

so the error is on Datacite side, not GBIF, correct?

katrinleinweber commented 4 years ago

Depends ;-) If GBIF exceeded some limit when submitting that "meta"data, one could argue it was an error on their side. I'm seeing that error when using rdatacite::dc_works(), though, so the download is coming form DataCite.org in that moment.

sckott commented 4 years ago

@katrinleinweber i couldn't replicate that exact error you had, but I did get an error I think is related to your problem, anyway, i added a parameter discard_xml in dc_works to delete the xml field before returning it to the console. the problem I think is that the very long base64 encoded xml string attempts to be printed by the R base method print.data.frame, and apparently there is some limit on how long a string can be for that method.

A way to make the data.frame output more readable is e.g.,

z <- dc_works("prefix:10.15468", rows = 15)
z$data <- tibble::as_tibble(z$data)
z

also, the max rows setting I think is 1000, added that to the docs

katrinleinweber commented 4 years ago

discard_xml

Thank you :-)

[...] max rows setting I think is 1000, added that to the docs

I bisected my way to 99999L. Seemed to work. 100000L & higher resulted in 403 Forbidden errors.

sckott commented 4 years ago

weird, 403 is a authentication error, hmmm

mfenner commented 4 years ago

@katrinleinweber and @sckott we retired our Solr service last Thursday, completing the transition to Elasticsearch. The Solr API that rdatacite is using was officially retired in January 2019, and we made multiple announcements in the past.

@sckott let me know if you need help transition to the DataCite REST API.

sckott commented 4 years ago

thanks @mfenner - will do

sckott commented 4 years ago

this fxn is now gone in the refactor branch - closing