Closed trangdata closed 1 year ago
Wow you got this in quick! BTW After reading #156, just curious about two things:
1) Would it be overkill to preserve institution_lineage
to a list column (as opposed to flattening it to character)?
So given something like `oa_fetch(identifier = "W1981492994")`, instead of `institution_lineage` being a character vector in this PR...
```r
oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] "" "https://openalex.org/I51556381"
```
It could be a list:
```r
list(NULL, c("https://openalex.org/I51556381"))
#> [[1]]
#> NULL
#>
#> [[2]]
#> [1] "https://openalex.org/I51556381"
```
2) When authors are missing the field entirely (like the first author in example above), the value is empty string - but do we want this to be NA
instead? It's a product of paste()
's funky behavior when input is NULL
paste(NULL, collapse = ",")
#> [1] ""
Thank you @yjunechoe for bringing up these points. I considered your first point for this implementation, but I think keeping institution_lineage as list right now is unnecessarily nested. Spotty wifi on the train atm but I think there are a few other fields that we flattened to character as well and that has worked quite well for simplification.
Regarding your 2nd point, I agree maybe removing all the ""
element in the character vector with, say, setdiff
could be the way to go. But then setdiff("", "")
is character(0)
which would then need to turn into NA_character_
before binding.
Thanks for entertaining this on your train ride! 😆
Maybe more simply, I was thinking of swapping ""
for NA
when lineage info is missing. Could we just extend the existing empty_inst
template to include "lineage"
in the expected inst_cols
:
And then wrap the paste()
line in an if
block to only apply if lineage data is present.
Going back to my example, what currently looks like this:
oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] "" "https://openalex.org/I51556381"
We might want that to look like this:
oa_fetch(identifier = "W1981492994")$author[[1]]$institution_lineage
#> [1] NA "https://openalex.org/I51556381"
Of course we can also do this after the fact and replace()
""
with NA
, but thought we might as well extend the empty_inst
template that we already use
After writing this and re-reading your comment, I realize that the author[[1]]
part of my reprex might have been confusing. It's a dataframe of not one but two authors, where the first author has lineage missing but its present for the second. So what currently looks like this:
oa_fetch(identifier = "W1981492994")$author[[1]]
#> au_id au_display_name au_orcid author_position
#> 1 https://openalex.org/A5012422493 Arnold Beichman NA first
#> 2 https://openalex.org/A5015292781 Thomas Cushman NA last
#> au_affiliation_raw institution_id
#> 1 <NA>
#> 2 Department of Sociology, University of Virginia, USA https://openalex.org/I51556381
#> institution_display_name institution_ror institution_country_code institution_type
#> 1 <NA> <NA> <NA> <NA>
#> 2 University of Virginia https://ror.org/0153tk833 US education
#> institution_lineage
#> 1
#> 2 https://openalex.org/I51556381
We might want it to look like this (note the NA
in first row of the institution_lineage
column):
oa_fetch(identifier = "W1981492994")$author[[1]]
#> au_id au_display_name au_orcid author_position
#> 1 https://openalex.org/A5012422493 Arnold Beichman NA first
#> 2 https://openalex.org/A5015292781 Thomas Cushman NA last
#> au_affiliation_raw institution_id
#> 1 <NA>
#> 2 Department of Sociology, University of Virginia, USA https://openalex.org/I51556381
#> institution_display_name institution_ror institution_country_code institution_type
#> 1 <NA> <NA> <NA> <NA>
#> 2 University of Virginia https://ror.org/0153tk833 US education
#> institution_lineage
#> 1 <NA>
#> 2 https://openalex.org/I51556381
Thank you so much @yjunechoe for clarifying with the clear example. Yes, this makes sense, and I prefer NA. Utilizing the existing empty_inst
template will also make it a lot more elegant! (I completely forgot we had this!) Could you submit a PR when you have a chance?
Resolves #154