ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
98 stars 21 forks source link

Fix works2df when there are multiple institution lineages #155

Closed trangdata closed 1 year ago

trangdata commented 1 year ago

Resolves #154

yjunechoe commented 1 year ago

Wow you got this in quick! BTW After reading #156, just curious about two things:

1) Would it be overkill to preserve institution_lineage to a list column (as opposed to flattening it to character)?

So given something like `oa_fetch(identifier = "W1981492994")`, instead of `institution_lineage` being a character vector in this PR...

```r
oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] ""                               "https://openalex.org/I51556381"
```

It could be a list:

```r
list(NULL, c("https://openalex.org/I51556381"))
#> [[1]]
#> NULL
#> 
#> [[2]]
#> [1] "https://openalex.org/I51556381"
```

2) When authors are missing the field entirely (like the first author in example above), the value is empty string - but do we want this to be NA instead? It's a product of paste()'s funky behavior when input is NULL

   paste(NULL, collapse = ",")
   #> [1] ""
trangdata commented 1 year ago

Thank you @yjunechoe for bringing up these points. I considered your first point for this implementation, but I think keeping institution_lineage as list right now is unnecessarily nested. Spotty wifi on the train atm but I think there are a few other fields that we flattened to character as well and that has worked quite well for simplification.

Regarding your 2nd point, I agree maybe removing all the "" element in the character vector with, say, setdiff could be the way to go. But then setdiff("", "") is character(0) which would then need to turn into NA_character_ before binding.

yjunechoe commented 1 year ago

Thanks for entertaining this on your train ride! 😆

Maybe more simply, I was thinking of swapping "" for NA when lineage info is missing. Could we just extend the existing empty_inst template to include "lineage" in the expected inst_cols:

https://github.com/ropensci/openalexR/blob/b450fe26a8e08e44f8af95643a4c0d9e53560a26/R/oa2df.R#L176-L177

And then wrap the paste() line in an if block to only apply if lineage data is present.

Going back to my example, what currently looks like this:

oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] ""                               "https://openalex.org/I51556381"

We might want that to look like this:

oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] NA                               "https://openalex.org/I51556381"

Of course we can also do this after the fact and replace() "" with NA, but thought we might as well extend the empty_inst template that we already use


After writing this and re-reading your comment, I realize that the author[[1]] part of my reprex might have been confusing. It's a dataframe of not one but two authors, where the first author has lineage missing but its present for the second. So what currently looks like this:

oa_fetch(identifier = "W1981492994")$author[[1]]
#>                              au_id au_display_name au_orcid author_position
#> 1 https://openalex.org/A5012422493 Arnold Beichman       NA           first
#> 2 https://openalex.org/A5015292781  Thomas Cushman       NA            last
#>                                     au_affiliation_raw                 institution_id
#> 1                                                                                <NA>
#> 2 Department of Sociology, University of Virginia, USA https://openalex.org/I51556381
#>   institution_display_name           institution_ror institution_country_code institution_type
#> 1                     <NA>                      <NA>                     <NA>             <NA>
#> 2   University of Virginia https://ror.org/0153tk833                       US        education
#>              institution_lineage
#> 1                               
#> 2 https://openalex.org/I51556381

We might want it to look like this (note the NA in first row of the institution_lineage column):

oa_fetch(identifier = "W1981492994")$author[[1]]
#>                              au_id au_display_name au_orcid author_position
#> 1 https://openalex.org/A5012422493 Arnold Beichman       NA           first
#> 2 https://openalex.org/A5015292781  Thomas Cushman       NA            last
#>                                     au_affiliation_raw                 institution_id
#> 1                                                                                <NA>
#> 2 Department of Sociology, University of Virginia, USA https://openalex.org/I51556381
#>   institution_display_name           institution_ror institution_country_code institution_type
#> 1                     <NA>                      <NA>                     <NA>             <NA>
#> 2   University of Virginia https://ror.org/0153tk833                       US        education
#>              institution_lineage
#> 1                           <NA>
#> 2 https://openalex.org/I51556381
trangdata commented 1 year ago

Thank you so much @yjunechoe for clarifying with the clear example. Yes, this makes sense, and I prefer NA. Utilizing the existing empty_inst template will also make it a lot more elegant! (I completely forgot we had this!) Could you submit a PR when you have a chance?