ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
89 stars 19 forks source link

Display warning for probably missing authors? #234

Closed mariusbommert closed 2 months ago

mariusbommert commented 2 months ago

Hi,

It would be nice to get some kind of warning if it is likely that not all authors of a work have been requested.

For example the following code leads to only 100 authors each with oa_fetch while there are more than 100 authors in total:

library(openalexR)

ids100 = c("https://openalex.org/W2789389013", "https://openalex.org/W2811921651")

fetch100 <- oa_fetch(
  identifier = ids100,
  output = "list",
  entity = "works"
)
unlist(lapply(fetch100, function(x) length(x$authorships)))
# [1] 100 100

fetch100_works <- lapply(ids100, function(x) {
  oa_request(
    query_url = oa_query(
      identifier = x,
      entity = "works"
    ),
    count_only = FALSE,
    verbose = FALSE
  )
})
unlist(lapply(fetch100_works, function(x) length(x$authorships)))
# [1] 377 149

A warning could be displayed for works with exactly 100 authors and maybe some hint how to request them if all authors are needed like mentioning https://docs.openalex.org/api-entities/authors/limitations or similar.

Best regards, Marius

trangdata commented 2 months ago

Thanks so much @mariusbommert for raising this issue. I wasn't aware of this truncation process. Do you know if this is recently added? It looks like is_authors_truncated is only available for certain works. For example, consider this query:

https://api.openalex.org/works?filter=openalex%3AW2741809807%7CW2811921651

is_authors_truncated is only a field for the second work, W2811921651.

Another concern is that the warnings might overwhelm other messages if somehow a query results in a lot of works with >100 authors.

I'll implement something for now but it seems like the OA team is still working on solidifying this.

mariusbommert commented 2 months ago

The fact that at most 100 authors per work are returned for filtering is not new in OpenAlex. https://docs.openalex.org/api-entities/authors/limitations has been last updated 8 months ago. I did not recognize the parameter is_authors_truncated before but it is mentioned in the OpenAlex documentation. I requested the specific documents with exactly 100 authors to get the missing authors.

You could check which works have truncated authors and display a single warning like "works W2789389013 and W2811921651 have missing authors. Do ... to request them if you need all authors". Maybe a function to request these authors might be helpful similar to my code above.

Getting only 100 authors is not a big problem if you know that there are missing authors and how you can request them. Author data can get quite large if you download lots of publications with more than 1000 authors. The bigger problem is the following part in the OpenAlex documentation: "This affects filtering as well. So if you filter works using an author ID or ROR, you will not receive works where that author is listed further than 100 places down on the list of authors." For some authors you might only find a few works because they are usually listed after position 100. I know that this can for example happen for physics publications with hundreds of alphabetically sorted authors. OpenAlex seems to plan to fix this part of the issue.