ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

BUG: `au_orcid` is of type `logical` when all authors have no ORCID (should be #220

Open rkrug opened 3 months ago

rkrug commented 3 months ago

Hi

there is a bug in the conversion from the oa results to a data.frame / tibble. When all authors of a work do not have an ORCID, the column au_orcid is of type 'logical;' while the others are as expected are of typecharacter`. This is causing problems, as I want to save these as parquet files which does not work if the objects are of different type.

Probably replacing NA`` withas.character(NA)` in the appropriate places would fix this issue. I assume the same can occur in other fields.

Thanks,

Rainer

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2 obs. of  6 variables:
 $ id              : chr  "https://openalex.org/W2101204002" "https://openalex.org/W2159758382"
 $ author          :List of 2
  ..$ :'data.frame':    7 obs. of  11 variables:
  .. ..$ au_id                   : chr  "https://openalex.org/A5066931706" "https://openalex.org/A5047672302" "https://openalex.org/A5006051784" "https://openalex.org/A5037636565" ...
  .. ..$ au_display_name         : chr  "Philippe Cury" "Andrew Bakun" "Robert J. M. Crawford" "Astrid Jarre" ...
  .. ..$ au_orcid                : chr  NA "https://orcid.org/0000-0002-4366-3846" NA "https://orcid.org/0000-0002-0690-6183" ...
  .. ..$ author_position         : chr  "first" "middle" "middle" "middle" ...
  .. ..$ au_affiliation_raw      : chr  "Institut de Recherche pour le Développement (IRD), Marine and Coastal ManagementPrivate Bag X2, Rogge Bay 8012,"| __truncated__ "University of Cape Town, Department of Oceanography7701 Rondebosch, South Africa" "Marine and Coastal ManagementPrivate Bag X2, 8012 Rogge Bay, Cape Town, South Africa" "Danish Institute for Fisheries Research, North Sea CentrePO Box 101, 9850 Hirtshals, Denmark" ...
  .. ..$ institution_id          : chr  "https://openalex.org/I157614274" "https://openalex.org/I157614274" NA NA ...
  .. ..$ institution_display_name: chr  "University of Cape Town" "University of Cape Town" NA NA ...
  .. ..$ institution_ror         : chr  "https://ror.org/03p74gp79" "https://ror.org/03p74gp79" NA NA ...
  .. ..$ institution_country_code: chr  "ZA" "ZA" NA NA ...
  .. ..$ institution_type        : chr  "education" "education" NA NA ...
  .. ..$ institution_lineage     : chr  "https://openalex.org/I157614274" "https://openalex.org/I157614274" NA NA ...
  ..$ :'data.frame':    2 obs. of  11 variables:
  .. ..$ au_id                   : chr  "https://openalex.org/A5023041174" "https://openalex.org/A5028708328"
  .. ..$ au_display_name         : chr  "Wilfred M. Post" "Kyeol Kwon"
  .. ..$ au_orcid                : logi  NA NA
  .. ..$ author_position         : chr  "first" "last"
  .. ..$ au_affiliation_raw      : chr  "Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6335, USA;" "Chemical Engineering Department, Tuskeegee University, Tuskeegee, AL 36088, USA"
  .. ..$ institution_id          : chr  "https://openalex.org/I1289243028" "https://openalex.org/I6026837"
  .. ..$ institution_display_name: chr  "Oak Ridge National Laboratory" "Tuskegee University"
  .. ..$ institution_ror         : chr  "https://ror.org/01qz5mb56" "https://ror.org/0137n4m74"
  .. ..$ institution_country_code: chr  "US" "US"
  .. ..$ institution_type        : chr  "facility" "education"
  .. ..$ institution_lineage     : chr  "https://openalex.org/I1289243028, https://openalex.org/I39565521, https://openalex.org/I4210159294" "https://openalex.org/I6026837"
 $ ab              : chr  "In upwelling ecosystems, there is often a crucial intermediate trophic level, occupied by small, plankton-feedi"| __truncated__ "Summary When agricultural land is no longer used for cultivation and allowed to revert to natural vegetation or"| __truncated__
 $ publication_year: int  2000 2000
 $ doi             : chr  "https://doi.org/10.1006/jmsc.2000.0712" "https://doi.org/10.1046/j.1365-2486.2000.00308.x"
 $ page            : num  1 1
trangdata commented 3 months ago

A quick fix would be to add a parameter for replace_w_na:

replace_w_na <- function(x, y = NA){
  lapply(x, `%||%`, y = y)
}

and then on this line https://github.com/ropensci/openalexR/blob/558581c6dbb43c65cd2003be8545e88fd4ed4ef7/R/oa2df.R#L230

do

prepend(replace_w_na(l$author, NA_character_), "au")

But you're right, this can happen in other fields as well. rbind.data.frame automatically converts the NA to whichever type of the non-NAs, but if all values are NA then they do remain of type logical. I'll wait to see if others can chime in.

rkrug commented 3 months ago

Is there any chance that you could take a look at this? The easiest might be to define a template with the expected types, and then use these? I have only an extremely slow fix running to correct this and I am working with millions of records... Thanks.