Open yhan818 opened 2 months ago
If the author has multiple institutions, we track only the first in $institution_id
but still track all in a flat (comma-separated string) structure in $institution_lineage
:
oa_fetch_test1$author[[1]][2,]$institution_id
#> [1] "https://openalex.org/I58286723"
oa_fetch_test1$author[[1]][2,]$institution_lineage
#> [1] "https://openalex.org/I1329765538, https://openalex.org/I58286723"
~So fetching those 2 institution IDs from $institution_lineage
gets back what you observed:~
oa_fetch_test1$author[[1]][2,]$institution_lineage |>
strsplit(", ") |>
el(1) |>
oa_fetch(entity = "institutions") |>
subset(, c("id", "display_name"))
#> # A tibble: 2 × 2
#> id display_name
#> <chr> <chr>
#> 1 https://openalex.org/I1329765538 Universities Space Research Association
#> 2 https://openalex.org/I58286723 Lunar and Planetary Institute
Ref: https://github.com/ropensci/openalexR/pull/155
Actually sorry that's not quite right. I still don't see "University of Arizona". I'm not sure whether the data structure allowed multiple institutions back when we first implemented this - @trangdata do you recall?
The structure for this "Malhotra" author is:
#> 'data.frame': 1 obs. of 12 variables:
#> $ au_id : chr "https://openalex.org/A5003933592"
#> $ au_display_name : chr "Renu Malhotra"
#> $ au_orcid : chr "https://orcid.org/0000-0002-1226-3305"
#> $ author_position : chr "middle"
#> $ is_corresponding : logi FALSE
#> $ au_affiliation_raw : chr "Lunar and Planetary Laboratory, The University of Arizona, USA"
#> $ institution_id : chr "https://openalex.org/I58286723"
#> $ institution_display_name: chr "Lunar and Planetary Institute"
#> $ institution_ror : chr "https://ror.org/01r4eh644"
#> $ institution_country_code: chr "US"
#> $ institution_type : chr "facility"
#> $ institution_lineage : chr "https://openalex.org/I1329765538, https://openalex.org/I58286723"
Thank you. It will be nice to have all the institutions available, given the number of cases like the above. My case shows about 10% of works.
There will be multiple ways to get it implemented, such as list(). or an additional fields
Thank you for this conversation @yhan818 and @yjunechoe. I think OpenAlex used to provide only one affiliation of authors, and when they introduced more affiliations/institutions, we still stick with exporting only the first one for simplicity. But you're right, we could make these list columns.
OK so currently, we have the following columns for author
, where institution_*
refers to the first institution reported by OpenAlex.
oa_fetch_test1 <- openalexR::oa_fetch(entity = "works", id = "https://openalex.org/W4401226694")
oa_fetch_test1$author[[1]] |>
dplyr::select(au_affiliation_raw, starts_with("institution"))
#> au_affiliation_raw
#> 1 Department of Astronomy & Astrophysics, University of Toronto, Canada
#> 2 Lunar and Planetary Laboratory, The University of Arizona, USA
#> 3 Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA
#> institution_id institution_display_name
#> 1 https://openalex.org/I185261750 University of Toronto
#> 2 https://openalex.org/I58286723 Lunar and Planetary Institute
#> 3 https://openalex.org/I111979921 Northwestern University
#> institution_ror institution_country_code institution_type
#> 1 https://ror.org/03dbr7087 CA education
#> 2 https://ror.org/01r4eh644 US facility
#> 3 https://ror.org/000e0be47 US education
#> institution_lineage
#> 1 https://openalex.org/I185261750
#> 2 https://openalex.org/I1329765538, https://openalex.org/I58286723
#> 3 https://openalex.org/I111979921
Created on 2024-09-08 with reprex v2.0.2
The question is, do we want to include affiliations
and/or institutions
as a list column, such that:
oa_fetch_test1$author[[1]]$affiliations
# [[1]]
# [[1]]$raw_affiliation_string
# [1] "Department of Astronomy & Astrophysics, University of Toronto, Canada"
#
# [[1]]$institution_ids
# [[1]]$institution_ids[[1]]
# [1] "https://openalex.org/I185261750"
#
#
#
# [[2]]
# [[2]]$raw_affiliation_string
# [1] "Lunar and Planetary Laboratory, The University of Arizona, USA"
#
# [[2]]$institution_ids
# [[2]]$institution_ids[[1]]
# [1] "https://openalex.org/I58286723"
#
# [[2]]$institution_ids[[2]]
# [1] "https://openalex.org/I138006243"
#
#
#
# [[3]]
# [[3]]$raw_affiliation_string
# [1] "Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA"
#
# [[3]]$institution_ids
# [[3]]$institution_ids[[1]]
# [1] "https://openalex.org/I111979921"
oa_fetch_test1$author[[1]]$institutions
#> [[1]]
#> # A tibble: 1 × 6
#> id display_name ror country_code type lineage
#> <chr> <chr> <chr> <chr> <chr> <named list>
#> 1 https://openalex.org/I185261750 University of Toronto https://ror.org/03dbr7087 CA education <list [1]>
#>
#> [[2]]
#> # A tibble: 2 × 6
#> id display_name ror country_code type lineage
#> <chr> <chr> <chr> <chr> <chr> <named list>
#> 1 https://openalex.org/I58286723 Lunar and Planetary Institute https://ror.org/01r4eh644 US facility <list [2]>
#> 2 https://openalex.org/I138006243 University of Arizona https://ror.org/03m2x1q45 US education <list [1]>
#>
#> [[3]]
#> # A tibble: 1 × 6
#> id display_name ror country_code type lineage
#> <chr> <chr> <chr> <chr> <chr> <named list>
#> 1 https://openalex.org/I111979921 Northwestern University https://ror.org/000e0be47 US education <list [1]>
What do we think? @yjunechoe @yhan818 What do we want to keep for backward compatibility? (again, I think it's good to keep in mind this change from one institution to more was from OpenAlex, so maybe a breaking change is necessary). Also note that there may be a cost in performance to do all this concatenation when we include everything like the lineage
list column above.
According to the documentation:
Each institutional affiliation that this author has claimed will be listed here: the raw affiliation string that we found, along with the OpenAlex Institution ID or IDs that we matched it to. [affiliations] is redundant with [institutions], but is useful if you need to know about what we used to match institutions.
OpenAlex has changed some outputs quite heavily in 2024. It has new data model and added new entities (e.g. grants).
In general, maintaining backward compatibility is a good practice. For example, it will not break code developed using the current openAlexR.
Shall we add a new field (e.g. author's affiliations) and leave the old one untouched?
@yhan818 I agree generally it's good practice to maintain backward compatibility, but we do have to balance that out with other factors like cost of maintenance, computation, complexity, etc. I have shared this view before. To sum up, as a third-party package, I think it's important we try to mirror how OpenAlex changes.
To keep up with OpenAlex changes is a moving target, and openalexR will always be running behind. But one could do the following, to offer both:
list
as it is essentially the response coming from OpenAlex, or as an alternative raw json as returned per page saved to files (see #271, https://github.com/ropensci/openalexR/issues/271#issuecomment-2338482943). This would always be backwards compatible.tibble
which need to be called separately. This makes it possible to have backward compatible functions as well as follow new approaches at a later stage. Also using the son with e.g. duckDB would for example nut require any conversion.The problem would be step on, i.e. changing a default value, which will break compatibility, but this could be introduces over a few version with deprecation warning.
Agreed with both of you in principle. Given the changes with openAlex, it is not mature. So back-comparability may not be that important. I am fine with either approach.
I am conducting institutional-level citation analysis.
There are some cases that an author having multiple affiliations. A parent organization may have multiple child organizations. For example, University of Arizona ROR (https://ror.org/03m2x1q45) have multiple units, including Lunar and Planetary Institute (https://ror.org/01r4eh644)
For certain works, an author has multiple institutions/affiliations associated with the work's metadata in OpenAlex.
oa_fetch_test1 <-oa_fetch( entity="works", id="https://openalex.org/W4401226694")
view(oa_fetch_test1[[4]][[1]])
It has " 2 https://openalex.org/I58286723 Lunar and Planetary Institute https://ror.org/01r4eh644 " only.
So oa_fetch() for "works" missing the additional institutions from openAlex's API data?