ropensci / rcrossref

R client for various CrossRef APIs
https://docs.ropensci.org/rcrossref
Other
166 stars 20 forks source link

UTF-8 error for format citeproc-json with new crossref API #221

Closed jdblischak closed 1 year ago

jdblischak commented 3 years ago

I just installed the latest GitHub version to solve the issue with the polite pool described in #218. However, now I am getting a new error, which I suspect is due to the UTF-8 encoding changes in the new crossref API discussed in #216.

The problematic DOI is 10.5194/bg-2021-40. Both the first author's surname (Muëller) and the title (pCO2) have special characters. This worked without issue prior to the recent API changes.

library("rcrossref")
packageVersion("rcrossref")
doi <- "10.5194/bg-2021-40" 
x <- rcrossref::cr_cn(doi)
x
# fails
json <- rcrossref::cr_cn(doi, format = "citeproc-json")

Here is what I see when I run the code above:

> library("rcrossref")
> packageVersion("rcrossref")
[1] ‘1.1.0.99’
> doi <- "10.5194/bg-2021-40"
> x <- rcrossref::cr_cn(doi)
> x
[1] "@article{M_ller_2021,\n\tdoi = {10.5194/bg-2021-40},\n\turl = {https://doi.org/10.5194%2Fbg-2021-40},\n\tyear = 2021,\n\tmonth = {mar},\n\tpublisher = {Copernicus {GmbH}},\n\tauthor = {Jens Daniel Müller and Bernd Schneider and Ulf Gräwe and Peer Fietzek and Marcus Bo Wallin and Anna Rutgersson and Norbert Wasmund and Siegfried Krüger and Gregor Rehder},\n\ttitle = {Cyanobacteria net community production in the Baltic Sea as\n\t\tinferred from profiling {\\&}lt$\\mathsemicolon$i{\\&}gt$\\mathsemicolon$p{\\&}lt$\\mathsemicolon$/i{\\&}gt$\\mathsemicolon${CO}{\\&}lt$\\mathsemicolon$sub{\\&}gt$\\mathsemicolon$2{\\&}lt$\\mathsemicolon$/sub{\\&}gt$\\mathsemicolon$ measurements}\n}"
> # fails
> json <- rcrossref::cr_cn(doi, format = "citeproc-json")
Error in nchar(hh) : invalid multibyte string, element 1

Searching for the error message, I found this SO post that solves the problem by converting the text to UTF-8.

Session Info ```r R version 4.0.5 (2021-03-31) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS Matrix products: default BLAS/LAPACK: /lib/libopenblasp-r0.3.15.so locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rcrossref_1.1.0.99 loaded via a namespace (and not attached): [1] Rcpp_1.0.6 pillar_1.6.1 compiler_4.0.5 later_1.2.0 [5] plyr_1.8.6 tools_4.0.5 digest_0.6.27 jsonlite_1.7.2 [9] lifecycle_1.0.0 tibble_3.1.2 pkgconfig_2.0.3 rlang_0.4.11 [13] shiny_1.6.0 crul_1.1.0 curl_4.3.1 fastmap_1.1.0 [17] xml2_1.3.2 dplyr_1.0.6 stringr_1.4.0 generics_0.1.0 [21] vctrs_0.3.8 htmlwidgets_1.5.3 triebeard_0.3.0 DT_0.18 [25] tidyselect_1.1.1 glue_1.4.2 httpcode_0.3.0 R6_2.5.0 [29] fansi_0.5.0 purrr_0.3.4 magrittr_2.0.1 urltools_1.7.3 [33] promises_1.2.0.1 ellipsis_0.3.2 htmltools_0.5.1.1 mime_0.10 [37] xtable_1.8-4 httpuv_1.6.1 utf8_1.2.1 stringi_1.6.2 [41] miniUI_0.1.1.1 crayon_1.4.1 ```
njahn82 commented 3 years ago

Thanks for your report, I can confirm this API behavior. As it seems to be a server issue, I think I won't be able to fix this. Maybe the option citeproc-json-ish works for you instead as it seems not be affected from this encoding issue?

rcrossref::cr_cn("10.5194/bg-2021-40", format = "citeproc-json", verbose = T)
#> Error in nchar(hh): invalid multibyte string, element 1

# Request directly
httr::GET("https://api.crossref.org/works/10.5194/bg-2021-40/transform/application/vnd.citationstyles.csl+json")
#> Error in substring(u, so, so + ml - 1L): invalid multibyte string, element 1

# instead, try
rcrossref::cr_cn("10.5194/bg-2021-40", format = "citeproc-json-ish")
#> $indexed
#> $indexed$`date-parts`
#>      [,1] [,2] [,3]
#> [1,] 2021    5    7
#> 
#> $indexed$`date-time`
#> [1] "2021-05-07T14:54:01Z"
#> 
#> $indexed$timestamp
#> [1] 1.620399e+12
#> 
#> 
#> $posted
#> $posted$`date-parts`
#>      [,1] [,2] [,3]
#> [1,] 2021    3    1
#> 
#> 
#> $`group-title`
#> [1] "Biogeochemistry: Organic Biogeochemistry"
#> 
#> $`reference-count`
#> [1] 0
#> 
#> $publisher
#> [1] "Copernicus GmbH"
#> 
#> $license
#>                                            URL start.date-parts
#> 1 https://creativecommons.org/licenses/by/4.0/       2021, 3, 1
#>        start.date-time start.timestamp delay-in-days content-version
#> 1 2021-03-01T00:00:00Z    1.614557e+12             0             vor
#> 
#> $`content-domain`
#> $`content-domain`$domain
#> list()
#> 
#> $`content-domain`$`crossmark-restriction`
#> [1] FALSE
#> 
#> 
#> $abstract
#> [1] "<jats:p>Abstract. Organic matter production by cyanobacteria blooms is a major environmental concern for the Baltic Sea as it promotes thespread of anoxic zones. Partial pressure of carbon dioxide (pCO2) measurements carried out on Ships of Opportunity (SOOP) since 2003 have proven to be a powerful tool to resolve the carbon dynamics of the blooms in space and time. However, SOOP measurements lack the possibility to directly constrain the depth–integrated net community production (NCP) due to their restriction to the sea surface. This study tackles the resulting knowledge gap through (1) providing a best–guess NCP estimatefor an individual cyanobacteria bloom based on repeated profiling measurements of pCO2 and (2) establishing an algorithm to accurately reconstruct depth–integrated NCP from surface pCO2 observations in combination with modelled temperature profiles. Goal (1) was achieved by deploying state–of–the–art sensor technology from a small–scale sailing vessel. The low–cost and flexible platform enabled observations covering an entire bloom event that occurred in July and August 2018 in the Eastern Gotland Sea. For the biogeochemical interpretation, recorded pCO2 profiles were converted to CT*, which is the dissolved inorganic carbon concentration normalised to alkalinity. We found that the investigated Nodularia–dominated bloom event had many biogeochemical characteristics in common with blooms in previous years. In particular, it lasted for about three weeks, caused a CT* drawdown of 80 μmol kg−1, and was accompanied by a sea surface temperature increase of 10 °C. The novel finding of this study is the vertical extension of the CT* drawdown up to 12 m water depth. Integration of the CT* drawdown across this depth and correction for vertical fluxes permit a best–guess NCP estimate of ~1.2 mol–C m−2. Addressing goal (2), we combined modelled hydrographical profiles with surface pCO2 observations recorded by SOOP Finnmaid within the study area. Introducing the temperature penetration depth (TPD) as a new parameter to integrate SOOP observations across depth, we achieve a reconstructed NCP estimate that agrees to the best–guess within 10 %. Applying the TPD approach to almost two decades of surface pCO2 observations available for the Baltic Sea bears the potential to provide new insights into the control and long–term trends of cyanobacteria NCP. This understanding is key for an effective design and monitoring of conservation measures aiming at a Good Environmental Status of the Baltic Sea.\n                        </jats:p>"
#> 
#> $DOI
#> [1] "10.5194/bg-2021-40"
#> 
#> $type
#> [1] "posted-content"
#> 
#> $created
#> $created$`date-parts`
#>      [,1] [,2] [,3]
#> [1,] 2021    3    1
#> 
#> $created$`date-time`
#> [1] "2021-03-01T14:12:12Z"
#> 
#> $created$timestamp
#> [1] 1.614608e+12
#> 
#> 
#> $source
#> [1] "Crossref"
#> 
#> $`is-referenced-by-count`
#> [1] 0
#> 
#> $title
#> [1] "Cyanobacteria net community production in the Baltic Sea as\ninferred from profiling &lt;i&gt;p&lt;/i&gt;CO&lt;sub&gt;2&lt;/sub&gt; measurements"
#> 
#> $prefix
#> [1] "10.5194"
#> 
#> $author
#>                                  ORCID authenticated-orcid       given
#> 1 http://orcid.org/0000-0003-3137-0883               FALSE Jens Daniel
#> 2                                 <NA>                  NA       Bernd
#> 3 http://orcid.org/0000-0003-4007-9764               FALSE         Ulf
#> 4 http://orcid.org/0000-0002-3555-1115               FALSE        Peer
#> 5 http://orcid.org/0000-0002-3082-8728               FALSE   Marcus Bo
#> 6 http://orcid.org/0000-0001-7656-1881               FALSE        Anna
#> 7 http://orcid.org/0000-0002-7979-7176               FALSE     Norbert
#> 8                                 <NA>                  NA   Siegfried
#> 9 http://orcid.org/0000-0002-0597-9989               FALSE      Gregor
#>       family   sequence affiliation
#> 1     Müller      first        NULL
#> 2  Schneider additional        NULL
#> 3      Gräwe additional        NULL
#> 4    Fietzek additional        NULL
#> 5     Wallin additional        NULL
#> 6 Rutgersson additional        NULL
#> 7    Wasmund additional        NULL
#> 8     Krüger additional        NULL
#> 9     Rehder additional        NULL
#> 
#> $member
#> [1] "3145"
#> 
#> $`container-title`
#> list()
#> 
#> $`original-title`
#> list()
#> 
#> $deposited
#> $deposited$`date-parts`
#>      [,1] [,2] [,3]
#> [1,] 2021    5    7
#> 
#> $deposited$`date-time`
#> [1] "2021-05-07T13:58:32Z"
#> 
#> $deposited$timestamp
#> [1] 1.620396e+12
#> 
#> 
#> $score
#> [1] 1
#> 
#> $subtitle
#> list()
#> 
#> $`short-title`
#> list()
#> 
#> $issued
#> $issued$`date-parts`
#>      [,1] [,2] [,3]
#> [1,] 2021    3    1
#> 
#> 
#> $`references-count`
#> [1] 0
#> 
#> $URL
#> [1] "http://dx.doi.org/10.5194/bg-2021-40"
#> 
#> $relation
#> $relation$`has-review`
#>   id-type                     id asserted-by
#> 1     doi 10.5194/bg-2021-40-RC1     subject
#> 2     doi 10.5194/bg-2021-40-RC2     subject
#> 
#> $relation$`has-comment`
#>   id-type                     id asserted-by
#> 1     doi 10.5194/bg-2021-40-AC1     subject
#> 2     doi 10.5194/bg-2021-40-AC2     subject
#> 
#> 
#> $subtype
#> [1] "preprint"

Created on 2021-08-30 by the reprex package (v2.0.0)

larsvilhuber commented 2 years ago

I get a similar problem:

> bibinfo.test <- cr_cn("10.1257/app.20130369", format = "bibentry")
Error in cr_GET(endpoint = sprintf("works/%s/agency", x), args = list(),  : 
  res$response_headers$`content-type` == "application/json;charset=UTF-8" is not TRUE
> bibinfo.test <- cr_cn("10.1257/app.20130369", format = "citeproc-json")
Error in cr_GET(endpoint = sprintf("works/%s/agency", x), args = list(),  : 
  res$response_headers$`content-type` == "application/json;charset=UTF-8" is not TRUE
> packageVersion("rcrossref")
[1] ‘1.1.0’

This previously worked.

FlukeAndFeather commented 2 years ago

I think this issue is in ropensci/crul client.R:526. crul doesn't seem to handle non-UTF-8 response headers. I'm running into the same issue with DOI:10.1126/science.aax9044.

schiavone1 commented 2 years ago

Has anyone figured out a way to get around this error? I've installed the latest dev version and am having no luck.

Error in cr_GET(endpoint = sprintf("works/%s/agency", x), args = list(), : res$response_headers$content-type == "application/json;charset=UTF-8" is not TRUE

FlukeAndFeather commented 2 years ago

@schiavone1 I don't think there's a workaround possible from the rcrossref side. I made a pull request to fix the underlying issue in crul. You might try installing my version of crul (remotes::install_github("FlukeAndFeather/crul", ref = "enc_detect")) and see if that works for you.

njahn82 commented 2 years ago

Great, thank you @FlukeAndFeather will wait for the crul merge!

wcerfgba commented 2 years ago

This has been fixed now as of crul 1.2. I updated my crul with install.packages but I am still getting this error with rcrossref. Is there a way I can force rcrossref to use my installed version of crul?

Bisaloo commented 2 years ago

@wcerfgba, you also need to install the development version of rcrossref (remotes::install_github("ropensci/rcrossref")) because there is one check still present in the CRAN version (removed in https://github.com/ropensci/rcrossref/commit/2382789118ff44e89234ca826ba7327f09ecb660).

jdblischak commented 2 years ago

I confirmed that my original issue is now fixed by installing crul 1.2 and the dev version of rcrossref. Please let me know if there is any additional testing I can do to help get this updated version submitted to CRAN.

library("rcrossref")
packageVersion("rcrossref")
## [1] ‘1.1.0.99’
doi <- "10.5194/bg-2021-40"
x <- rcrossref::cr_cn(doi)
json <- rcrossref::cr_cn(doi, format = "citeproc-json")
json$title
## [1] "Cyanobacteria net community production in the Baltic Sea as\ninferred from profiling &lt;i&gt;p&lt;/i&gt;CO&lt;sub&gt;2&lt;/sub&gt; measurements"
jdblischak commented 1 year ago

Any update on a future CRAN release? Anything I can do to help with this?

njahn82 commented 1 year ago

Hi @jdblischak An updated version of rcrossref is now available on CRAN

jdblischak commented 1 year ago

@njahn82 Thanks so much! I confirmed that the latest CRAN release fixed my issue