ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
100 stars 21 forks source link

Obtaining count by year for more than 10 years? #107

Closed JeffreySmithA closed 1 year ago

JeffreySmithA commented 1 year ago

I'm trying to count citations by year for authors and publications for more than 10 years? The count by year only goes up to 10 years. Is there any easy way to do this? Currently, I'm thinking the easiest way may be using the snowball option but I wanted to ask here first.

trangdata commented 1 year ago

Hi @JeffreySmithA, one way to do this is simply use oa_fetch to find all works by your group of authors, then summarise the cited_by_count column.

For example, say I want to find the citation counts over the years of Emmanuelle Charpentier and Jennifer Doudna, I can do:

library(openalexR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

dat <- oa_fetch("works", author.id = c("A4358797308", "A2163159272")) |> 
  unnest(author) |> 
  filter(grepl("A4358797308|A2163159272", au_id)) |> 
  group_by(au_id, publication_year) |> 
  summarise(cited_by_count = sum(cited_by_count), .groups = "drop")
dat
#> # A tibble: 59 × 3
#>    au_id                            publication_year cited_by_count
#>    <chr>                                       <int>          <int>
#>  1 https://openalex.org/A2163159272             1993             92
#>  2 https://openalex.org/A2163159272             1995            145
#>  3 https://openalex.org/A2163159272             1998            111
#>  4 https://openalex.org/A2163159272             1999            420
#>  5 https://openalex.org/A2163159272             2000            364
#>  6 https://openalex.org/A2163159272             2002              2
#>  7 https://openalex.org/A2163159272             2003            101
#>  8 https://openalex.org/A2163159272             2004            438
#>  9 https://openalex.org/A2163159272             2007             12
#> 10 https://openalex.org/A2163159272             2008            296
#> # … with 49 more rows

Created on 2023-05-09 with reprex v2.0.2

JeffreySmithA commented 1 year ago

This is amazing!! Thank you so much, this saved me a lot of time.

Final question. Do you know how to do this at the paper level? So, I have a specific work with its work id, e.g. W2010555999 (random number). I'm still struggling to manage to adopt your solution to get all of the citations to the paper.

Thanks in advance!

massimoaria commented 1 year ago

@trangdata's proposed solution provides, for each year of publication, the total citations received by papers published in that year to date. The information stored by OpenAlex in the count_by_year field, on the other hand, counts how many citations an author has received in that year. The two measures are very different. For example: If author X publishes two articles (Y and Z):

work Y, publication year 2020, Count_by_year: 2020: 2 citations 2021: 20 citations 2022: 35 citations The cited_by_count field will be equal to the sum, 57.

work Z, publication year 2021, Count_by_year: 2021: 3 citations 2022: 8 citations The cited_by_count field will equal the sum, 11.

Now if we calculate the total count_by_year of author X we will have that (considering his two works) he will have been cited: 2 times in 2020 23 times in 2021 43 times in 2020

This is the calculation that OpenAlex performs when it calculates the count_by_year field.

If we use Trang's solution instead, we will get that the count_by_year of author X will be equal to: 2020: 57 2021: 11

Unfortunately, a calculation of the count_by_year vector beyond ten years cannot be easily obtained at present. The only possible, but extremely expensive, solution is to perform an author snowball search and count the works citing author X year by year.

trangdata commented 1 year ago

Hi @JeffreySmithA please see @massimoaria's answer above for the difference between what OpenAlex reports as counts_by_year (number of citations an author gets a year) vs. how my earlier code calculated the number of citations an author gets from publications of a certain year.

That said, if you want to find all works that cite a specific paper, say, W2160237763, you can use the cites filter then count the number of citations by year:

library(openalexR)
dat2 <- oa_fetch("works", cites = "W2160237763") |>
  dplyr::count(publication_year)
dat2
#> # A tibble: 18 × 2
#>    publication_year     n
#>               <int> <int>
#>  1             2005     5
#>  2             2006     5
#>  3             2007    11
#>  4             2008     9
#>  5             2009     6
#>  6             2010    13
#>  7             2011     8
#>  8             2012    12
#>  9             2013    15
#> 10             2014    13
#> 11             2015    11
#> 12             2016     5
#> 13             2017     6
#> 14             2018     8
#> 15             2019     4
#> 16             2020     9
#> 17             2021     6
#> 18             2022     5

Created on 2023-05-09 with reprex v2.0.2

JeffreySmithA commented 1 year ago

Thank you both very much. Incredibly helpful!