ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
93 stars 21 forks source link

`oa_snowball` returns `Error in if (is.na(so_info)) NA else so_info[[1]]` when snowballing large number of cites #95

Open TimothyElder opened 1 year ago

TimothyElder commented 1 year ago

When running oa_snowball on all the works that cite one highly cited article a large number of works are returned and the script takes a long time to run. After returning about 100,000 works, the script returns error:

Error in if (is.na(so_info)) NA else so_info[[1]] : 
  argument is of length zero
Calls: oa_snowball -> do.call -> <Anonymous> -> oa2df -> works2df

Looking at the source code I can't quite make sense why this error is returned. And I can't think of a way of more efficiently returning all the works. Here is how I do it now:

# Returns the citing and cited entities from a focal set of entities
snowball_docs <- oa_snowball(
  identifier = "W2147016542",
  verbose = TRUE,
  is_retracted = FALSE
)

edges <- as.data.frame(snowball_docs$edges)

nodes <- as.data.frame(snowball_docs$nodes)

# Works that Cite the focal study 
citing_works <- edges %>%
    filter(from != "W2147016542") %>%
    select(from) %>%
    as.vector()

# Node attributes for works that cite focal article
citing_works_df <- nodes %>%
    filter(id %in% citing_works$from)

# return all the articles that cite the citing works
second_docs <- oa_snowball(
  identifier = citing_works_df$id,
  verbose = TRUE,
  cited_by_filter = list(cited_by_count = c(">1000", "<30000"))
)
trangdata commented 1 year ago

Hi @TimothyElder thanks so much for reporting this. 🌱

It is expected that the script takes a while to run because oa_snowball retrieves all works that cite and are cited by the focal work. When your set of focal work is over 5000 works, this can take a very long time, especially if some of these focal works have a lot of citations.

In this particular case, I think you run out of memory in R. The result of the following query finding works that are cited by a subset of your focal works somehow result in over 7GiB of memory used in session. I'll keep investigating, but I suggest breaking your identifier = citing_works_df$id into small chunks, e.g., identifier = citing_works_df$id[1:10], writing out the results, then combining them all later. Let me know if that works.

library(openalexR)
ids <- c("W2119340816", "W4285719527", "W4211208840", "W4247785462", "W4211082352", "W2163351155", "W4210992155", "W2103903454", "W2549006299", "W2026141069", "W3126128017", "W2145354914", "W2086643853", "W2085458222", "W1988902102", "W2095880617", "W2139524347", "W2109565845", "W2112652525", "W2137200701", "W2144330816", "W2552595635", "W1996710573", "W2051676630", "W1875373156", "W2761242421", "W2134119471", "W2125665528", "W2111285159", "W2147485520", "W2121875608", "W2561425398", "W4238604577", "W2336794604", "W2106742300", "W4211174791", "W1958810146", "W2184779060", "W2169678441", "W1942996532", "W2165335733", "W2098206882", "W2073051214", "W2168197710", "W2017506719", "W2469676206", "W2094905849", "W2099192919", "W2124028388", "W4248178819")
oa_fetch(
  cited_by = ids,
  verbose = TRUE,
  cited_by_count = c(">1000", "<30000")
)
TimothyElder commented 1 year ago

@trangdata Thanks!

I kept working on this and found a similar solution to the one that you outlined. Instead of breaking it up by feeding in chunks of the data, I used the citing_filter AND the cited_by_filter (previously I was just using the latter). Like this:

oa_snowball(
  identifier = ids,
  verbose = TRUE,
  citing_filter = list(cited_by_count = c(">500", "<30000")),
  cited_by_filter = list(cited_by_count = c(">500", "<30000")),
  is_retracted = FALSE
)

Then plan on doing a few more passes with oa_snowball, where I return different combinations of those filters to get the complete network of snowball docs. Not super efficient, but workable.

trangdata commented 1 year ago

@TimothyElder one thing I noticed just now: did you mean for the conditions to be AND instead of OR? i.e., you're looking for works that are cited by over 500 other articles but less than 30,000 articles? If so, I think the following is what you want (I know, the syntax is strange with the lists with same element names). We need to add more documentation regarding these operators. https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists

oa_snowball(
  identifier = ids,
  verbose = TRUE,
  citing_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
  cited_by_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
  is_retracted = FALSE
)
TimothyElder commented 1 year ago

@trangdata Yes!! Very good catch. This was my way of chunking out the process, though now that I look at the code I wrote, i see that there are some mistakes. But, yes I meant for the snowball to return only articles that are cited by more than 500 other articles but less than 30,000 articles. I also added the citing_filter with the same parameters, but now realize that my use doesn't make any sense if I understand the filter correctly.

For my own clarification the cited_by_filter is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography). If that is the case then my original use of the citing_filter doesn't really make any sense since there is likely no articles that cite more than 500 other articles and less than 30000 other articles.

Sorry in advance if that is confusing, and the documentation even on OpenAlex is a little confusing about the logical expressions.

trangdata commented 1 year ago

For my own clarification the cited_by_filter is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography).

Yes, you're correct @TimothyElder. 💯 Also, we're open to new PRs if you would like to improve the documentation! 🙏🏽 🪴