Open TimothyElder opened 1 year ago
Hi @TimothyElder thanks so much for reporting this. 🌱
It is expected that the script takes a while to run because oa_snowball retrieves all works that cite and are cited by the focal work. When your set of focal work is over 5000 works, this can take a very long time, especially if some of these focal works have a lot of citations.
In this particular case, I think you run out of memory in R. The result of the following query finding works that are cited by a subset of your focal works somehow result in over 7GiB of memory used in session. I'll keep investigating, but I suggest breaking your identifier = citing_works_df$id
into small chunks, e.g., identifier = citing_works_df$id[1:10]
, writing out the results, then combining them all later. Let me know if that works.
library(openalexR)
ids <- c("W2119340816", "W4285719527", "W4211208840", "W4247785462", "W4211082352", "W2163351155", "W4210992155", "W2103903454", "W2549006299", "W2026141069", "W3126128017", "W2145354914", "W2086643853", "W2085458222", "W1988902102", "W2095880617", "W2139524347", "W2109565845", "W2112652525", "W2137200701", "W2144330816", "W2552595635", "W1996710573", "W2051676630", "W1875373156", "W2761242421", "W2134119471", "W2125665528", "W2111285159", "W2147485520", "W2121875608", "W2561425398", "W4238604577", "W2336794604", "W2106742300", "W4211174791", "W1958810146", "W2184779060", "W2169678441", "W1942996532", "W2165335733", "W2098206882", "W2073051214", "W2168197710", "W2017506719", "W2469676206", "W2094905849", "W2099192919", "W2124028388", "W4248178819")
oa_fetch(
cited_by = ids,
verbose = TRUE,
cited_by_count = c(">1000", "<30000")
)
@trangdata Thanks!
I kept working on this and found a similar solution to the one that you outlined. Instead of breaking it up by feeding in chunks of the data, I used the citing_filter
AND the cited_by_filter
(previously I was just using the latter). Like this:
oa_snowball(
identifier = ids,
verbose = TRUE,
citing_filter = list(cited_by_count = c(">500", "<30000")),
cited_by_filter = list(cited_by_count = c(">500", "<30000")),
is_retracted = FALSE
)
Then plan on doing a few more passes with oa_snowball
, where I return different combinations of those filters to get the complete network of snowball docs. Not super efficient, but workable.
@TimothyElder one thing I noticed just now: did you mean for the conditions to be AND
instead of OR
? i.e., you're looking for works that are cited by over 500 other articles but less than 30,000 articles? If so, I think the following is what you want (I know, the syntax is strange with the lists with same element names). We need to add more documentation regarding these operators. https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists
oa_snowball(
identifier = ids,
verbose = TRUE,
citing_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
cited_by_filter = list(cited_by_count = ">500", cited_by_count = "<30000"),
is_retracted = FALSE
)
@trangdata Yes!! Very good catch. This was my way of chunking out the process, though now that I look at the code I wrote, i see that there are some mistakes. But, yes I meant for the snowball to return only articles that are cited by more than 500 other articles but less than 30,000 articles. I also added the citing_filter
with the same parameters, but now realize that my use doesn't make any sense if I understand the filter correctly.
For my own clarification the cited_by_filter
is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter
, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography). If that is the case then my original use of the citing_filter
doesn't really make any sense since there is likely no articles that cite more than 500 other articles and less than 30000 other articles.
Sorry in advance if that is confusing, and the documentation even on OpenAlex is a little confusing about the logical expressions.
For my own clarification the cited_by_filter is used to control articles that are returned by the number of times that article is cited by other articles. The citing_filter, on the other hand, is used to control the number of articles the focal article cites (that is the length of its own bibliography).
Yes, you're correct @TimothyElder. 💯 Also, we're open to new PRs if you would like to improve the documentation! 🙏🏽 🪴
When running
oa_snowball
on all the works that cite one highly cited article a large number of works are returned and the script takes a long time to run. After returning about 100,000 works, the script returns error:Looking at the source code I can't quite make sense why this error is returned. And I can't think of a way of more efficiently returning all the works. Here is how I do it now: