ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
101 stars 21 forks source link

oa_fetch () missing additional author's institutions? #270

Open yhan818 opened 2 months ago

yhan818 commented 2 months ago

I am conducting institutional-level citation analysis.

There are some cases that an author having multiple affiliations. A parent organization may have multiple child organizations. For example, University of Arizona ROR (https://ror.org/03m2x1q45) have multiple units, including Lunar and Planetary Institute (https://ror.org/01r4eh644)

For certain works, an author has multiple institutions/affiliations associated with the work's metadata in OpenAlex.

  1. If I fetch the work's data using openAlex. oa_fetch_test1 <-oa_fetch( entity="works", id="https://openalex.org/W4401226694") view(oa_fetch_test1[[4]][[1]])

It has " 2 https://openalex.org/I58286723 Lunar and Planetary Institute https://ror.org/01r4eh644 " only.

  1. If going back to openAlex's API https://api.openalex.org/works/W4401226694 It has both (Lunar and Planetary Institute" and "University of Arizona".

So oa_fetch() for "works" missing the additional institutions from openAlex's API data?

Screenshot from 2024-08-28 15-48-21

yjunechoe commented 2 months ago

If the author has multiple institutions, we track only the first in $institution_id but still track all in a flat (comma-separated string) structure in $institution_lineage:

oa_fetch_test1$author[[1]][2,]$institution_id
#> [1] "https://openalex.org/I58286723"

oa_fetch_test1$author[[1]][2,]$institution_lineage
#> [1] "https://openalex.org/I1329765538, https://openalex.org/I58286723"

~So fetching those 2 institution IDs from $institution_lineage gets back what you observed:~

oa_fetch_test1$author[[1]][2,]$institution_lineage |> 
  strsplit(", ") |> 
  el(1) |> 
  oa_fetch(entity = "institutions") |> 
  subset(, c("id", "display_name"))
#> # A tibble: 2 × 2
#>   id                               display_name                           
#>   <chr>                            <chr>                                  
#> 1 https://openalex.org/I1329765538 Universities Space Research Association
#> 2 https://openalex.org/I58286723   Lunar and Planetary Institute

Ref: https://github.com/ropensci/openalexR/pull/155


Actually sorry that's not quite right. I still don't see "University of Arizona". I'm not sure whether the data structure allowed multiple institutions back when we first implemented this - @trangdata do you recall?

The structure for this "Malhotra" author is:

#> 'data.frame':    1 obs. of  12 variables:
#>  $ au_id                   : chr "https://openalex.org/A5003933592"
#>  $ au_display_name         : chr "Renu Malhotra"
#>  $ au_orcid                : chr "https://orcid.org/0000-0002-1226-3305"
#>  $ author_position         : chr "middle"
#>  $ is_corresponding        : logi FALSE
#>  $ au_affiliation_raw      : chr "Lunar and Planetary Laboratory, The University of Arizona, USA"
#>  $ institution_id          : chr "https://openalex.org/I58286723"
#>  $ institution_display_name: chr "Lunar and Planetary Institute"
#>  $ institution_ror         : chr "https://ror.org/01r4eh644"
#>  $ institution_country_code: chr "US"
#>  $ institution_type        : chr "facility"
#>  $ institution_lineage     : chr "https://openalex.org/I1329765538, https://openalex.org/I58286723"
yhan818 commented 2 months ago

Thank you. It will be nice to have all the institutions available, given the number of cases like the above. My case shows about 10% of works.

There will be multiple ways to get it implemented, such as list(). or an additional fields

trangdata commented 2 months ago

Thank you for this conversation @yhan818 and @yjunechoe. I think OpenAlex used to provide only one affiliation of authors, and when they introduced more affiliations/institutions, we still stick with exporting only the first one for simplicity. But you're right, we could make these list columns.

https://github.com/ropensci/openalexR/blob/774aff7c6160163bd7b28960e864425079b8d5a2/R/oa2df.R#L222-L236

trangdata commented 2 months ago

OK so currently, we have the following columns for author, where institution_* refers to the first institution reported by OpenAlex.

oa_fetch_test1 <- openalexR::oa_fetch(entity = "works", id = "https://openalex.org/W4401226694")
oa_fetch_test1$author[[1]] |> 
  dplyr::select(au_affiliation_raw, starts_with("institution"))
#>                                                                                                                                                                      au_affiliation_raw
#> 1                                                                                                                 Department of Astronomy & Astrophysics, University of Toronto, Canada
#> 2                                                                                                                        Lunar and Planetary Laboratory, The University of Arizona, USA
#> 3 Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA
#>                    institution_id      institution_display_name
#> 1 https://openalex.org/I185261750         University of Toronto
#> 2  https://openalex.org/I58286723 Lunar and Planetary Institute
#> 3 https://openalex.org/I111979921       Northwestern University
#>             institution_ror institution_country_code institution_type
#> 1 https://ror.org/03dbr7087                       CA        education
#> 2 https://ror.org/01r4eh644                       US         facility
#> 3 https://ror.org/000e0be47                       US        education
#>                                                institution_lineage
#> 1                                  https://openalex.org/I185261750
#> 2 https://openalex.org/I1329765538, https://openalex.org/I58286723
#> 3                                  https://openalex.org/I111979921

Created on 2024-09-08 with reprex v2.0.2

The question is, do we want to include affiliations and/or institutions as a list column, such that:

oa_fetch_test1$author[[1]]$affiliations
# [[1]]
# [[1]]$raw_affiliation_string
# [1] "Department of Astronomy & Astrophysics, University of Toronto, Canada"
# 
# [[1]]$institution_ids
# [[1]]$institution_ids[[1]]
# [1] "https://openalex.org/I185261750"
# 
# 
# 
# [[2]]
# [[2]]$raw_affiliation_string
# [1] "Lunar and Planetary Laboratory, The University of Arizona, USA"
# 
# [[2]]$institution_ids
# [[2]]$institution_ids[[1]]
# [1] "https://openalex.org/I58286723"
# 
# [[2]]$institution_ids[[2]]
# [1] "https://openalex.org/I138006243"
# 
# 
# 
# [[3]]
# [[3]]$raw_affiliation_string
# [1] "Dept. of Physics and Astronomy, Northwestern University, 2145 Sheridan Rd., Evanston, IL 60208 and Center for Interdisciplinary Exploration and Research in Astrophysics (CIERA), USA"
# 
# [[3]]$institution_ids
# [[3]]$institution_ids[[1]]
# [1] "https://openalex.org/I111979921"

oa_fetch_test1$author[[1]]$institutions
#> [[1]]
#> # A tibble: 1 × 6
#> id                              display_name          ror                       country_code type      lineage     
#> <chr>                           <chr>                 <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I185261750 University of Toronto https://ror.org/03dbr7087 CA           education <list [1]>  
#>   
#>   [[2]]
#> # A tibble: 2 × 6
#> id                              display_name                  ror                       country_code type      lineage     
#> <chr>                           <chr>                         <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I58286723  Lunar and Planetary Institute https://ror.org/01r4eh644 US           facility  <list [2]>  
#>   2 https://openalex.org/I138006243 University of Arizona         https://ror.org/03m2x1q45 US           education <list [1]>  
#>   
#>   [[3]]
#> # A tibble: 1 × 6
#> id                              display_name            ror                       country_code type      lineage     
#> <chr>                           <chr>                   <chr>                     <chr>        <chr>     <named list>
#>   1 https://openalex.org/I111979921 Northwestern University https://ror.org/000e0be47 US           education <list [1]>  

What do we think? @yjunechoe @yhan818 What do we want to keep for backward compatibility? (again, I think it's good to keep in mind this change from one institution to more was from OpenAlex, so maybe a breaking change is necessary). Also note that there may be a cost in performance to do all this concatenation when we include everything like the lineage list column above.

According to the documentation:

Each institutional affiliation that this author has claimed will be listed here: the raw affiliation string that we found, along with the OpenAlex Institution ID or IDs that we matched it to. [affiliations] is redundant with [institutions], but is useful if you need to know about what we used to match institutions.

yhan818 commented 1 month ago

OpenAlex has changed some outputs quite heavily in 2024. It has new data model and added new entities (e.g. grants).

In general, maintaining backward compatibility is a good practice. For example, it will not break code developed using the current openAlexR.

Shall we add a new field (e.g. author's affiliations) and leave the old one untouched?

trangdata commented 1 month ago

@yhan818 I agree generally it's good practice to maintain backward compatibility, but we do have to balance that out with other factors like cost of maintenance, computation, complexity, etc. I have shared this view before. To sum up, as a third-party package, I think it's important we try to mirror how OpenAlex changes.

rkrug commented 1 month ago

To keep up with OpenAlex changes is a moving target, and openalexR will always be running behind. But one could do the following, to offer both:

  1. The default format of fetch is list as it is essentially the response coming from OpenAlex, or as an alternative raw json as returned per page saved to files (see #271, https://github.com/ropensci/openalexR/issues/271#issuecomment-2338482943). This would always be backwards compatible.
  2. Offer functions which convert, the list or json into a tibble which need to be called separately. This makes it possible to have backward compatible functions as well as follow new approaches at a later stage. Also using the son with e.g. duckDB would for example nut require any conversion.

The problem would be step on, i.e. changing a default value, which will break compatibility, but this could be introduces over a few version with deprecation warning.

yhan818 commented 1 month ago

Agreed with both of you in principle. Given the changes with openAlex, it is not mature. So back-comparability may not be that important. I am fine with either approach.