ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
97 stars 21 forks source link

New metadata fields for work entity #210

Open massimoaria opened 7 months ago

massimoaria commented 7 months ago

@trangdata @yjunechoe Recently, OA has added a lot of new metadata for entity work. In particular, the API now also reports info regarding keywords, topics, grants that funded the research, APC paid, etc.

At the moment the only way to access this information is to use the "list" format.

TO DO: Modify the works2df() function so that the data frame also includes this new metadata. This way even using the "tibble" or "data.frame" format will output this new metadata.

yjunechoe commented 7 months ago

Good point! And actually as a first step, I think it'd be helpful if we tracked somewhere what fields we already have covered vs. those that are new.

As a naive approach, this lists all fields from output="list" that's not present as a column in output="tibble":

library(openalexR)

tbl <- oa_fetch(id = "W2755950973")
lst <- oa_fetch(id = "W2755950973", output = "list")

sort(names(lst)[!names(lst) %in% colnames(tbl)])
#>  [1] "abstract_inverted_index"       "apc_list"                     
#>  [3] "apc_paid"                      "authorships"                  
#>  [5] "best_oa_location"              "biblio"                       
#>  [7] "cited_by_percentile_year"      "corresponding_author_ids"     
#>  [9] "corresponding_institution_ids" "countries_distinct_count"     
#> [11] "created_date"                  "fulltext_origin"              
#> [13] "has_fulltext"                  "indexed_in"                   
#> [15] "institutions_distinct_count"   "keywords"                     
#> [17] "locations"                     "locations_count"              
#> [19] "mesh"                          "ngrams_url"                   
#> [21] "open_access"                   "primary_location"             
#> [23] "primary_topic"                 "referenced_works_count"       
#> [25] "sustainable_development_goals" "title"                        
#> [27] "topics"                        "type_crossref"                
#> [29] "updated_date"

This of course doesn't mean we're missing coverage for these fields - some of them have been renamed in the df (e.g., authorships), intentionally dropped due to redundancy or low merit (e.g., title), or already covered via other means (e.g., we might not need ngrams_url given that we have the oa_ngrams() interface). But it's hard to distinguish those cases from fields like apc_list which is clearly new and not yet covered.

So as a preliminary, maybe it's worth introducing something to internally track covered fields, like:

#' @keywords internal
covered_fields <- c("title", "authorships", ...)

Then we (or at least I) can get a clearer picture of what we're missing and have a programmatic way to track the introduction of new fields.

I can take a stab at this, then reconvene here to decide how to deal with the new fields? For example, it immediately jumps out to me that apc_paid and apd_list share similar structures - I think it may be worth combining them into a single list column apc of data frames. Ex:

Original:

lst$apc_list
#> $value
#> [1] 3680
#> 
#> $currency
#> [1] "USD"
#> 
#> $value_usd
#> [1] 3680
#> 
#> $provenance
#> [1] "doaj"

lst$apc_paid
#> $value
#> [1] 3680
#> 
#> $currency
#> [1] "USD"
#> 
#> $value_usd
#> [1] 3680
#> 
#> $provenance
#> [1] "doaj"

Formatted:

rbind.data.frame(
  c(type = "list", lst$apc_list),
  c(type = "paid", lst$apc_paid)
)
#>   type value currency value_usd provenance
#> 1 list  3680      USD      3680       doaj
#> 2 paid  3680      USD      3680       doaj
massimoaria commented 7 months ago

I totally agree