ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
102 stars 21 forks source link

Number of edges from `oa_snowball` doesn't match `cited_by_count` #178

Open TimothyElder opened 1 year ago

TimothyElder commented 1 year ago

When returning all the works that are cited by and that cite a focal article the number of edges in returned edges data frame that go to the focal article should match the cited_by_count of the focal article, but it seems that they usually do not.

I am trying to figure out whether this is an artifact in the data or whether I have misunderstood precisely what oa_snowball returns.

Here is an example of where I think the edges should match but they don't:

library(openalexR)

focal_article <- oa_fetch(
  entity = "works",
  doi = c("10.1056/nejmoa1000678"),
  verbose = TRUE
)

snowball_docs <- oa_snowball(
  identifier = focal_article$id,
  verbose = TRUE
)

edges <- snowball_docs$edges

id <- stringr::str_replace(focal_article$id, "https://openalex.org/", "")

# drop all works the focal work cites
edges <- edges |>
  filter(to == id)

# Raise error if edges don't match focal_article citation count
tryCatch({
  if(nrow(edges) != focal_article$cited_by_count) {
      stop("Number of edges doesn't match cited by count of focal article!")
  }
}, error = function(e) {
  cat("An error occurred: ", e$message, "\n")
})
yjunechoe commented 1 year ago

Thanks for the report! Definitely not ideal, but it's likely the same situation also reported in https://github.com/ropensci/openalexR/issues/115

For what it's worth, in my experience with snowball searching it's pretty common to have mismatches between the cited-by number in a paper's records vs. its discoverable connections (even within the same database). You can just think of the number of articles returned by backward-searching in oa_snowball() as the absolute lower bound estimate of cited-by (which doesn't account for older papers, retracted papers, inaccessible papers, etc.).

TimothyElder commented 1 year ago

The discrepancy doesn't seem to be very sizable fortunately, less than 10 or so per article in my estimation. Thanks!