Closed trangdata closed 2 years ago
In bibliometrics, a snowball search means to identify a target work and starting from it, to download all items citing it and all items cited by it. Then are we thinking about a function that automatically creates this kind of query (all citing works and all cited works from an ID) and downloads the metadata? Yes, I think it could be interesting.
But I also have another idea. The most popular bibliographic database, Scopus and Web of Science, allow downloading full metadata about references for each work. At the moment, OA returns the list of identifiers as cited reference metadata. It could be interesting to allow the users to download more info about cited references. For example a record with: Authors, Title, Journal, Publication Year, and DOI. In this way, the data could be used to perform co-citation analysis, historiograph analysis, etc. as implemented in bibliometrix, VOSviewer, etc. We could add an argument (e.g. bibliography = c(TRUE, FALSE)) to the functions oa_fetch and oa_request.
@trangdata I have just added a commit that introduces the function oa_snowball(). oa_snowball fetch all documents citing and cited by one or more target identifiers. It returns a tibble including all documents. I added a column "role" in which is reported the role of each document included in the tibble ("citing", "cited", "target")
@massimoaria Amazing work, thank you! I have been swamped this week, but I'm excited to try it out soon! Please feel free to close the issue.
I would like to work also on the "other ideas". So for the moment, I prefer to leave the issue open.
@massimoaria Amazing work, thank you! I have been swamped this week, but I'm excited to try it out soon! Please feel free to close the issue.
@trangdata Have you tried the oa_snowball function? What do you think about it? Remarks? Ideas?
@massimoaria I will try to do this in the next week. All these concepts are still very new to me so it will take a while, but I'll get there!
After exploring oa_snowball
a bit, here's what I'm thinking:
Should we return a list of 2 elements:
from, to
(a row A, B
means A cites B), each element is an openalex IDWe should be cautious of works that cite multiple input identifiers at the same time (this should be captured in the output list's first element). This would also make it easier to perform network analyses on the result (or perhaps co-citation analysis, historiograph analysis, etc. as implemented in bibliometrix, VOSviewer, etc. as you mentioned).
What do you think? @massimoaria @yjunechoe
Also, I'm not seeing we utilize the "cited_by" as an OA filter, only "cites". Should we check on this?
@trangdata following your remark, I have just modified the function that now returns a list of two elements:
The function code has been also modified to be more efficient.
@trangdata I think that's the way to go as well! When I implemented snowball output on Snowglobe, all information from the snowball search was returned as a single dataframe, where it was a cbind of paper metadata, and columns for connection information, kind of like this:
id | title | ... | n_cites | cites | n_cited_by | cited_by |
---|---|---|---|---|---|---|
100 | foo | ... | 2 | 98, 99 | 1 | 101 |
200 | bar | ... | 2 | 198, 199 | 2 | 201, 202 |
300 | wug | ... | 4 | 296, 297, 298, 299 | 2 | 301, 302 |
This was an ugly format, though I will say that this was actually the preferred format for the paper screening team (who'd work together on google sheets, highlighting rows to indicate paper inclusion/exclusion). With this 1-table format they could prioritize screening papers with high citations/references by sorting on the n_*
columns, and they could quickly see what other papers a paper has connections to with a simple Ctrl+F on the comma-separated cites
and cited_by
columns. Perhaps something like this could also be another "flattened" format that's offered?
And thanks @massimoaria for implementing this! I just tested it out and have three thoughts on the current output format:
1) If the idea is to have $data
store only the metadata of the papers involved in the search, should it contain unique entries for papers? For example the the snowball search from the docs has a duplicate entry for a paper in $data
which only varies in the value of the role
column, and this information can be recovered from the $relationships
dataframe:
dups <- snowball_docs$data %>%
filter(id == id[duplicated(id)])
unique(dups$id)
#> [1] "https://openalex.org/W2785823074"
waldo::compare(dups[1,], dups[2,])
#> `old$role`: "citing"
#> `new$role`: "cited"
2) Currently, the IDs in from/to
and id
are not exact matches because of the gsub()
. I agree that it's redundant if we keep paper ids as the full url in $relationships
, but this does make it difficult to do relational data operations between relationships
and data
later. No clean solution either way but this was the first thing that came to mind:
3) Kind of related to (2), one such relational data operation that someone like might want to do is to convert it to a <tbl_graph>
object from {tidygraph}
or for more serious network analysis/visualization. I think it's be nice if the elements of the output of oa_snowball()
follow the standardized names "edges"
and "nodes"
, so it can be directly fed into tidygraph::as_tbl_graph()
(another alternative is "vertices"
and "links"
) - hopefully this is a trivial request!
For example, here's a short network analysis workflow I'm imagining with the output of oa_snowball()
, addressing points 1-3 explicitly in the data wrangling here:
options(pillar.print_min = 3)
# Example from the docs
snowball_docs <- oa_snowball(
identifier = c("W2741809807", "W2755950973"),
endpoint = "https://api.openalex.org/",
verbose = TRUE
)
snowball_docs
#> $relationships
#> # A tibble: 2,133 × 2
#> from to
#> <chr> <chr>
#> 1 W3160856016 W2755950973
#> 2 W3036495543 W2755950973
#> 3 W3123854369 W2741809807
#> # … with 2,130 more rows
#>
#> $data
#> # A tibble: 2,132 × 27
#> id displ…¹ author ab publi…² relev…³ so so_id publi…⁴ issn url first…⁵ last_…⁶ volume issue is_oa cited…⁷
#> <chr> <chr> <list> <chr> <chr> <lgl> <chr> <chr> <chr> <lis> <chr> <chr> <chr> <chr> <chr> <lgl> <int>
#> 1 https://… How to… <df> Bibl… 2021-0… NA Jour… http… Elsevi… <chr> http… 285 296 133 <NA> TRUE 436
#> 2 https://… Impact… <df> The … 2020-0… NA Anna… http… Spring… <chr> http… <NA> <NA> <NA> <NA> TRUE 267
#> 3 https://… Data‐D… <df> Data… 2019-0… NA Adva… http… Wiley <chr> http… 1900808 1900808 6 21 TRUE 178
#> # … with 2,129 more rows, 10 more variables: counts_by_year <list>, publication_year <int>, cited_by_api_url <chr>,
#> # ids <list>, doi <chr>, type <chr>, referenced_works <list>, related_works <list>, concepts <list>, role <chr>, and
#> # abbreviated variable names ¹display_name, ²publication_date, ³relevance_score, ⁴publisher, ⁵first_page, ⁶last_page,
#> # ⁷cited_by_count
snowball_docs_formatted <- snowball_docs
# Point #1) Remove duplicated paper id in metadata dataframe
snowball_docs_formatted$data <- snowball_docs_formatted$data %>%
filter(!duplicated(id))
# Point #2) Turn `to` and `from` columns into workable keys
snowball_docs_formatted$relationships <- snowball_docs_formatted$relationships %>%
mutate(across(c(from, to), \(x) paste0("https://openalex.org/", x)))
# Point #3) Use standardized names
names(snowball_docs_formatted) <- c("edges", "nodes")
# Graph conversion with {tidygraph}
( snowball_graph <- tidygraph::as_tbl_graph(snowball_docs_formatted) )
#> # A tbl_graph: 2131 nodes and 2133 edges
#> #
#> # A bipartite simple graph with 1 component
#> #
#> # Node Data: 2,131 × 27 (active)
#> id displa… author ab public… releva… so so_id publis… issn url first_… last_p… volume issue is_oa cited_…
#> <chr> <chr> <list> <chr> <chr> <lgl> <chr> <chr> <chr> <lis> <chr> <chr> <chr> <chr> <chr> <lgl> <int>
#> 1 http… How to… <df> Bibl… 2021-0… NA Jour… http… Elsevi… <chr> http… 285 296 133 <NA> TRUE 436
#> 2 http… Impact… <df> The … 2020-0… NA Anna… http… Spring… <chr> http… <NA> <NA> <NA> <NA> TRUE 267
#> 3 http… Data‐D… <df> Data… 2019-0… NA Adva… http… Wiley <chr> http… 1900808 1900808 6 21 TRUE 178
#> 4 http… Assess… <df> Asse… 2018-0… NA PLOS… http… Public… <chr> http… e20040… e20040… 16 3 TRUE 174
#> 5 http… Conduc… <df> Lite… 2020-0… NA Aust… http… SAGE <chr> http… 175 194 45 2 TRUE 161
#> 6 http… Artifi… <df> Abst… 2020-1… NA Jour… http… Elsevi… <chr> http… 283 314 121 <NA> FALSE 128
#> # … with 2,125 more rows, and 10 more variables: counts_by_year <list>, publication_year <int>, cited_by_api_url <chr>,
#> # ids <list>, doi <chr>, type <chr>, referenced_works <list>, related_works <list>, concepts <list>, role <chr>
#> #
#> # Edge Data: 2,133 × 2
#> from to
#> <int> <int>
#> 1 1 2130
#> 2 2 2130
#> 3 3 2131
#> # … with 2,130 more rows
# Example workflow: subsetting the graph to only include information about mutual citations
snowball_graph %>%
activate("edges") %>%
filter(edge_is_mutual()) %>%
activate("nodes") %>%
filter(!node_is_isolated()) %>%
select(id, display_name, publication_year)
#> # A tbl_graph: 2 nodes and 2 edges
#> #
#> # A directed simple graph with 1 component
#> #
#> # Node Data: 2 × 3 (active)
#> id display_name publication_…
#> <chr> <chr> <int>
#> 1 https://openalex.org/W2785823074 Sci-Hub provides access to nearly all scholarly literature 2018
#> 2 https://openalex.org/W2741809807 The state of OA: a large-scale analysis of the prevalence and impact of O… 2018
#> #
#> # Edge Data: 2 × 2
#> from to
#> <int> <int>
#> 1 1 2
#> 2 2 1
Again, thank you so much to both of you for working on the snowball search!
Thank you so much @yjunechoe for this detailed feedback! I've submitted #22 to address these points, but happy to further brainstorm if the solutions are not optimal.
to_disk
that takes the output of oa_snowball
and flattens it. It basically joins the nodes
element with a cited_by
column for the input identifiers (and yield NA for other documents). I wanted to keep it as a separate helper function to avoid direct dependency on dplyr. What do you think?role
can be inferred from edges
, but I also agree with @massimoaria and think it's nice to have that column, especially to indicate "target" — our input document. So, for articles with both citing and cited role, I put down "both". What do you both think? And is "target" a good name?id_type = c("short", "original")
as an argument for oa_snowball
, so the IDs between nodes and edges always match.Thanks for the quick turnaround! Addressing points in order of least to most complex:
RE: 0; Actually I misremembered some key details of the flattened output (sorry!). Let me think and try again from scratch - I'll move this out to a new issue/PR since it's a separate post-processing step. I'll also see if I can implement everything in base R.
RE: 2 & 3; they look great! I like the "short"
default. Just one thought: when id_type = "short"
, should id
column in the returned dataframe be renamed to short_id
? I'm leaning towards "keep the output consistent" which is how it is now, but just wondering whether people might get confused b/c we say short_id
in the output of show_works()
Rest is also good!
RE: 1; I just realized I've actually been totally misinterpreting the edges dataframe!!! I thought
from
andto
were citation directions, not search directions. I was under the impression that something like below says W1 cites W2 (citation from W1 to W2), when instead it just means W2 was searched from W1, without commiting to whether W1 cites W2 or W1 is cited by W2.
from | to |
---|---|
W1 | W2 |
Hmm I'm not sure if I understand your point here. What is a "search" direction? I think your original interpretation of edges was correct. This would mean W1 cites W2. All of these edges would have either from
or to
be one of the input identifiers. Am I also misinterpreting this?
RE: 2 & 3; they look great! I like the "short" default. Just one thought: when id_type = "short", should id column in the returned dataframe be renamed to short_id? I'm leaning towards "keep the output consistent" which is how it is now, but just wondering whether people might get confused b/c we say short_id in the output of show_works()
Great point. I went ahead and change short_id to id in show_works(), because technically these are still valid openalex ids. They're just short-form. I explained this in the documentation for show_works() and show_authors().
Hmm I'm not sure if I understand your point here. What is a "search" direction? I think your original interpretation of edges was correct. This would mean W1 cites W2. All of these edges would have either
from
orto
be one of the input identifiers. Am I also misinterpreting this?
Ah you're right I mixed that up 🤦♂️ - but yay that means less things to worry about!
We can come back to the distinction between "search" vs. "citation" direction, but they're just different representations of the same data. So if the input to snowball was paper B
and our edges data looks like this:
from | to |
---|---|
A | B |
B | C |
Then a edges representing citation directions would look like this, where arrow = direction of citation:
graph LR;
A-->B
B-->C
While edges representing search directions would look like this, where arrow = direction of discovery from input paper, and you'd separately encode cites/cited_by (e.g. like in linetype here):
graph LR;
B-. forward .->A
B-- backward -->C
If you're only doing one-off snowball search, I don't think there's much difference but if you want to diagnose snowball searches that run multiple interations, then search directions can be useful!
Thank you @yjunechoe for the explanation! 💯 🤩 And I didn't know about mermaid!!! 🤯
So, couple last things I'd like some input on: @massimoaria @yjunechoe
Thank you @yjunechoe for the explanation! 💯 🤩 And I didn't know about mermaid!!! 🤯
So, couple last things I'd like some input on: @massimoaria @yjunechoe
- Right now, nodes$role are citing, cited, both, and target. Are these the best levels? Or should we make it clearer: "cites target", "cited by target", "both", "target"? Any other ideas?
- Does from A to B means A cites B a good convention? Or should it be the other way around?
Sorry for my absence but this semester is really busy for me. The "relationships" object is an edge matrix in which the directional link A (from) -> B (to) means that A cites B. This is the only concept that makes sense if you want to use this information to create a graph of direct citations (e.g., Historiograph as implemented in bibliometrix).
Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.
- Right now, nodes$role are citing, cited, both, and target. Are these the best levels? Or should we make it clearer: "cites target", "cited by target", "both", "target"? Any other ideas?
This also has me kind of stumped, especially if there are multiple input papers. But as long as it's a factor the docs can clarify on what each level means.
- Does from A to B means A cites B a good convention? Or should it be the other way around?
I think this makes more sense than the other way around! I suppose the documentation could emphasize this point more but it's not difficult to me
Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.
I think we need to be careful referencing "publication_date", since there may be an online early release date (or similar) and there may be a final date later (usually months after). If this paper is cited by another article that gets published just before this final date, it's going to be difficult to infer the citation direction. And by "both" here, I only mean that, in a particular query, this one article A is cited by (at least) one of the identifiers (input articles) and A also cites another identifier. Does that make sense? Or should we use a completely different term to indicate "not target"? target and neighbor? old and new? source and peripheral? hub and other? anything else?
I'm also okay with removing this column altogether, although I still think "role" is helpful in flagging the original identifiers. Another idea is to make a boolean column named "input" or "provided", TRUE if the work is in the original list of identifiers, else FALSE.
Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.
I think we need to be careful referencing "publication_date", since there may be an online early release date (or similar) and there may be a final date later (usually months after). If this paper is cited by another article that gets published just before this final date, it's going to be difficult to infer the citation direction. And by "both" here, I only mean that, in a particular query, this one article A is cited by (at least) one of the identifiers (input articles) and A also cites another identifier. Does that make sense? Or should we use a completely different term to indicate "not target"? target and neighbor? old and new? source and peripheral? hub and other? anything else?
Yes, You are right. This is a good point.
I'm also okay with removing this column altogether, although I still think "role" is helpful in flagging the original identifiers.
I think this makes more sense than the other way around! I suppose the documentation could emphasize this point more but it's not difficult to me
Could you explain why this is? @yjunechoe In my head, A -> B means A points up to B, so B is the ancestor/A cites B.
Yes, A -> B means A cites B then B is the ancestor. So, in bibliometrics, the "citation action" comes from A to B. That is the information we report in the relationships matrix.
Could you explain why this is? @yjunechoe In my head, A -> B means A points up to B, so B is the ancestor/A cites B.
Oh I was agreeing with you that "from A to B" means "A cites B" to me! Like "there is a citation from A to B"
Oh I was agreeing with you that "from A to B" means "A cites B" to me! Like "there is a citation from A to B"
Oops my bad I read your comment without "than" and its meaning is reverse. 🙃
Back to the role
column: Instead, what do you both think about a boolean column named "oa_input" or ..?, TRUE if the work is in the original list of identifiers, else FALSE.
Resolved in #22. Remaining discussion moved to #23.
After a discussion on openalexR and other packages (microdemic and fulltext), @yjunechoe suggested we implement a function to perform snowball searches. Specifically, given an identifier (say, DOI), find the papers it cites and the papers that cite it.
Reference: https://github.com/yjunechoe/Snowglobe