Snowball search and some other ideas

trangdata commented 2 years ago

After a discussion on openalexR and other packages (microdemic and fulltext), @yjunechoe suggested we implement a function to perform snowball searches. Specifically, given an identifier (say, DOI), find the papers it cites and the papers that cite it.

Reference: https://github.com/yjunechoe/Snowglobe

massimoaria commented 2 years ago

In bibliometrics, a snowball search means to identify a target work and starting from it, to download all items citing it and all items cited by it. Then are we thinking about a function that automatically creates this kind of query (all citing works and all cited works from an ID) and downloads the metadata? Yes, I think it could be interesting.

But I also have another idea. The most popular bibliographic database, Scopus and Web of Science, allow downloading full metadata about references for each work. At the moment, OA returns the list of identifiers as cited reference metadata. It could be interesting to allow the users to download more info about cited references. For example a record with: Authors, Title, Journal, Publication Year, and DOI. In this way, the data could be used to perform co-citation analysis, historiograph analysis, etc. as implemented in bibliometrix, VOSviewer, etc. We could add an argument (e.g. bibliography = c(TRUE, FALSE)) to the functions oa_fetch and oa_request.

massimoaria commented 2 years ago

@trangdata I have just added a commit that introduces the function oa_snowball(). oa_snowball fetch all documents citing and cited by one or more target identifiers. It returns a tibble including all documents. I added a column "role" in which is reported the role of each document included in the tibble ("citing", "cited", "target")

trangdata commented 2 years ago

@massimoaria Amazing work, thank you! I have been swamped this week, but I'm excited to try it out soon! Please feel free to close the issue.

massimoaria commented 2 years ago

I would like to work also on the "other ideas". So for the moment, I prefer to leave the issue open.

massimoaria commented 2 years ago

@massimoaria Amazing work, thank you! I have been swamped this week, but I'm excited to try it out soon! Please feel free to close the issue.

@trangdata Have you tried the oa_snowball function? What do you think about it? Remarks? Ideas?

trangdata commented 2 years ago

@massimoaria I will try to do this in the next week. All these concepts are still very new to me so it will take a while, but I'll get there!

trangdata commented 2 years ago

After exploring oa_snowball a bit, here's what I'm thinking:

Should we return a list of 2 elements:

Relationship table of 2 columns: from, to (a row A, B means A cites B), each element is an openalex ID
Metadata of all of the openalex IDs in (1).

We should be cautious of works that cite multiple input identifiers at the same time (this should be captured in the output list's first element). This would also make it easier to perform network analyses on the result (or perhaps co-citation analysis, historiograph analysis, etc. as implemented in bibliometrix, VOSviewer, etc. as you mentioned).

What do you think? @massimoaria @yjunechoe

Also, I'm not seeing we utilize the "cited_by" as an OA filter, only "cites". Should we check on this?

massimoaria commented 2 years ago

@trangdata following your remark, I have just modified the function that now returns a list of two elements:

Relationship table of 2 columns: from, to (a row A, B means A cites B), each element is an openalex ID
Metadata of all of the openalex IDs in (1).

The function code has been also modified to be more efficient.

yjunechoe commented 2 years ago

@trangdata I think that's the way to go as well! When I implemented snowball output on Snowglobe, all information from the snowball search was returned as a single dataframe, where it was a cbind of paper metadata, and columns for connection information, kind of like this:

id	title	...	n_cites	cites	n_cited_by	cited_by
100	foo	...	2	98, 99	1	101
200	bar	...	2	198, 199	2	201, 202
300	wug	...	4	296, 297, 298, 299	2	301, 302

This was an ugly format, though I will say that this was actually the preferred format for the paper screening team (who'd work together on google sheets, highlighting rows to indicate paper inclusion/exclusion). With this 1-table format they could prioritize screening papers with high citations/references by sorting on the n_* columns, and they could quickly see what other papers a paper has connections to with a simple Ctrl+F on the comma-separated cites and cited_by columns. Perhaps something like this could also be another "flattened" format that's offered?

And thanks @massimoaria for implementing this! I just tested it out and have three thoughts on the current output format:

1) If the idea is to have $data store only the metadata of the papers involved in the search, should it contain unique entries for papers? For example the the snowball search from the docs has a duplicate entry for a paper in $data which only varies in the value of the role column, and this information can be recovered from the $relationships dataframe:

dups <- snowball_docs$data %>%
  filter(id == id[duplicated(id)])
unique(dups$id)
#> [1] "https://openalex.org/W2785823074"
waldo::compare(dups[1,], dups[2,])
#> `old$role`: "citing"
#> `new$role`: "cited"

2) Currently, the IDs in from/to and id are not exact matches because of the gsub(). I agree that it's redundant if we keep paper ids as the full url in $relationships, but this does make it difficult to do relational data operations between relationships and data later. No clean solution either way but this was the first thing that came to mind:

https://github.com/massimoaria/openalexR/blob/b3507d547da9ef1603a686949dda73db9de6ec3d/R/oa_snowball.R#L72-L74

3) Kind of related to (2), one such relational data operation that someone like might want to do is to convert it to a <tbl_graph> object from {tidygraph} or for more serious network analysis/visualization. I think it's be nice if the elements of the output of oa_snowball() follow the standardized names "edges" and "nodes", so it can be directly fed into tidygraph::as_tbl_graph() (another alternative is "vertices" and "links") - hopefully this is a trivial request!

For example, here's a short network analysis workflow I'm imagining with the output of oa_snowball(), addressing points 1-3 explicitly in the data wrangling here:

options(pillar.print_min = 3)

# Example from the docs
snowball_docs <- oa_snowball(
  identifier = c("W2741809807", "W2755950973"),
  endpoint = "https://api.openalex.org/",
  verbose = TRUE
)

snowball_docs
#> $relationships
#> # A tibble: 2,133 × 2
#>   from        to         
#>   <chr>       <chr>      
#> 1 W3160856016 W2755950973
#> 2 W3036495543 W2755950973
#> 3 W3123854369 W2741809807
#> # … with 2,130 more rows
#> 
#> $data
#> # A tibble: 2,132 × 27
#>   id        displ…¹ author ab    publi…² relev…³ so    so_id publi…⁴ issn  url   first…⁵ last_…⁶ volume issue is_oa cited…⁷
#>   <chr>     <chr>   <list> <chr> <chr>   <lgl>   <chr> <chr> <chr>   <lis> <chr> <chr>   <chr>   <chr>  <chr> <lgl>   <int>
#> 1 https://… How to… <df>   Bibl… 2021-0… NA      Jour… http… Elsevi… <chr> http… 285     296     133    <NA>  TRUE      436
#> 2 https://… Impact… <df>   The … 2020-0… NA      Anna… http… Spring… <chr> http… <NA>    <NA>    <NA>   <NA>  TRUE      267
#> 3 https://… Data‐D… <df>   Data… 2019-0… NA      Adva… http… Wiley   <chr> http… 1900808 1900808 6      21    TRUE      178
#> # … with 2,129 more rows, 10 more variables: counts_by_year <list>, publication_year <int>, cited_by_api_url <chr>,
#> #   ids <list>, doi <chr>, type <chr>, referenced_works <list>, related_works <list>, concepts <list>, role <chr>, and
#> #   abbreviated variable names ¹display_name, ²publication_date, ³relevance_score, ⁴publisher, ⁵first_page, ⁶last_page,
#> #   ⁷cited_by_count

snowball_docs_formatted <- snowball_docs

# Point #1) Remove duplicated paper id in metadata dataframe
snowball_docs_formatted$data <- snowball_docs_formatted$data %>% 
  filter(!duplicated(id))

# Point #2) Turn `to` and `from` columns into workable keys
snowball_docs_formatted$relationships <- snowball_docs_formatted$relationships %>% 
  mutate(across(c(from, to), \(x) paste0("https://openalex.org/", x)))

# Point #3) Use standardized names
names(snowball_docs_formatted) <- c("edges", "nodes")

# Graph conversion with {tidygraph}
( snowball_graph <- tidygraph::as_tbl_graph(snowball_docs_formatted) )
#> # A tbl_graph: 2131 nodes and 2133 edges
#> #
#> # A bipartite simple graph with 1 component
#> #
#> # Node Data: 2,131 × 27 (active)
#>   id    displa… author ab    public… releva… so    so_id publis… issn  url   first_… last_p… volume issue is_oa cited_…
#>   <chr> <chr>   <list> <chr> <chr>   <lgl>   <chr> <chr> <chr>   <lis> <chr> <chr>   <chr>   <chr>  <chr> <lgl>   <int>
#> 1 http… How to… <df>   Bibl… 2021-0… NA      Jour… http… Elsevi… <chr> http… 285     296     133    <NA>  TRUE      436
#> 2 http… Impact… <df>   The … 2020-0… NA      Anna… http… Spring… <chr> http… <NA>    <NA>    <NA>   <NA>  TRUE      267
#> 3 http… Data‐D… <df>   Data… 2019-0… NA      Adva… http… Wiley   <chr> http… 1900808 1900808 6      21    TRUE      178
#> 4 http… Assess… <df>   Asse… 2018-0… NA      PLOS… http… Public… <chr> http… e20040… e20040… 16     3     TRUE      174
#> 5 http… Conduc… <df>   Lite… 2020-0… NA      Aust… http… SAGE    <chr> http… 175     194     45     2     TRUE      161
#> 6 http… Artifi… <df>   Abst… 2020-1… NA      Jour… http… Elsevi… <chr> http… 283     314     121    <NA>  FALSE     128
#> # … with 2,125 more rows, and 10 more variables: counts_by_year <list>, publication_year <int>, cited_by_api_url <chr>,
#> #   ids <list>, doi <chr>, type <chr>, referenced_works <list>, related_works <list>, concepts <list>, role <chr>
#> #
#> # Edge Data: 2,133 × 2
#>    from    to
#>   <int> <int>
#> 1     1  2130
#> 2     2  2130
#> 3     3  2131
#> # … with 2,130 more rows

# Example workflow: subsetting the graph to only include information about mutual citations
snowball_graph %>% 
  activate("edges") %>% 
  filter(edge_is_mutual()) %>% 
  activate("nodes") %>% 
  filter(!node_is_isolated()) %>% 
  select(id, display_name, publication_year)
#> # A tbl_graph: 2 nodes and 2 edges
#> #
#> # A directed simple graph with 1 component
#> #
#> # Node Data: 2 × 3 (active)
#>   id                               display_name                                                               publication_…
#>   <chr>                            <chr>                                                                              <int>
#> 1 https://openalex.org/W2785823074 Sci-Hub provides access to nearly all scholarly literature                          2018
#> 2 https://openalex.org/W2741809807 The state of OA: a large-scale analysis of the prevalence and impact of O…          2018
#> #
#> # Edge Data: 2 × 2
#>    from    to
#>   <int> <int>
#> 1     1     2
#> 2     2     1

Again, thank you so much to both of you for working on the snowball search!

trangdata commented 2 years ago

Thank you so much @yjunechoe for this detailed feedback! I've submitted #22 to address these points, but happy to further brainstorm if the solutions are not optimal.

On your first point re "flatten" format, I added function to_disk that takes the output of oa_snowball and flattens it. It basically joins the nodes element with a cited_by column for the input identifiers (and yield NA for other documents). I wanted to keep it as a separate helper function to avoid direct dependency on dplyr. What do you think?
I agree that role can be inferred from edges, but I also agree with @massimoaria and think it's nice to have that column, especially to indicate "target" — our input document. So, for articles with both citing and cited role, I put down "both". What do you both think? And is "target" a good name?
I added id_type = c("short", "original") as an argument for oa_snowball, so the IDs between nodes and edges always match.
I made this straightforward change.

yjunechoe commented 2 years ago

Thanks for the quick turnaround! Addressing points in order of least to most complex:

RE: 0; Actually I misremembered some key details of the flattened output (sorry!). Let me think and try again from scratch - I'll move this out to a new issue/PR since it's a separate post-processing step. I'll also see if I can implement everything in base R.

RE: 2 & 3; they look great! I like the "short" default. Just one thought: when id_type = "short", should id column in the returned dataframe be renamed to short_id? I'm leaning towards "keep the output consistent" which is how it is now, but just wondering whether people might get confused b/c we say short_id in the output of show_works()

Rest is also good!

(misunderstood point 1 initially)

trangdata commented 2 years ago

RE: 1; I just realized I've actually been totally misinterpreting the edges dataframe!!! I thought from and to were citation directions, not search directions. I was under the impression that something like below says W1 cites W2 (citation from W1 to W2), when instead it just means W2 was searched from W1, without commiting to whether W1 cites W2 or W1 is cited by W2.

from	to
W1	W2

Hmm I'm not sure if I understand your point here. What is a "search" direction? I think your original interpretation of edges was correct. This would mean W1 cites W2. All of these edges would have either from or to be one of the input identifiers. Am I also misinterpreting this?

trangdata commented 2 years ago

RE: 2 & 3; they look great! I like the "short" default. Just one thought: when id_type = "short", should id column in the returned dataframe be renamed to short_id? I'm leaning towards "keep the output consistent" which is how it is now, but just wondering whether people might get confused b/c we say short_id in the output of show_works()

Great point. I went ahead and change short_id to id in show_works(), because technically these are still valid openalex ids. They're just short-form. I explained this in the documentation for show_works() and show_authors().

yjunechoe commented 2 years ago

Hmm I'm not sure if I understand your point here. What is a "search" direction? I think your original interpretation of edges was correct. This would mean W1 cites W2. All of these edges would have either from or to be one of the input identifiers. Am I also misinterpreting this?

Ah you're right I mixed that up 🤦‍♂️ - but yay that means less things to worry about!

We can come back to the distinction between "search" vs. "citation" direction, but they're just different representations of the same data. So if the input to snowball was paper B and our edges data looks like this:

from	to
A	B
B	C

Then a edges representing citation directions would look like this, where arrow = direction of citation:

graph LR;
  A-->B
  B-->C

While edges representing search directions would look like this, where arrow = direction of discovery from input paper, and you'd separately encode cites/cited_by (e.g. like in linetype here):

graph LR;
  B-. forward .->A
  B-- backward -->C

If you're only doing one-off snowball search, I don't think there's much difference but if you want to diagnose snowball searches that run multiple interations, then search directions can be useful!

trangdata commented 2 years ago

Thank you @yjunechoe for the explanation! 💯 🤩 And I didn't know about mermaid!!! 🤯

So, couple last things I'd like some input on: @massimoaria @yjunechoe

Right now, nodes$role are citing, cited, both, and target. Are these the best levels? Or should we make it clearer: "cites target", "cited by target", "both", "target"? Any other ideas?
Does from A to B means A cites B a good convention? Or should it be the other way around?

massimoaria commented 2 years ago

Thank you @yjunechoe for the explanation! 💯 🤩 And I didn't know about mermaid!!! 🤯

So, couple last things I'd like some input on: @massimoaria @yjunechoe

Right now, nodes$role are citing, cited, both, and target. Are these the best levels? Or should we make it clearer: "cites target", "cited by target", "both", "target"? Any other ideas?

Does from A to B means A cites B a good convention? Or should it be the other way around?

Sorry for my absence but this semester is really busy for me. The "relationships" object is an edge matrix in which the directional link A (from) -> B (to) means that A cites B. This is the only concept that makes sense if you want to use this information to create a graph of direct citations (e.g., Historiograph as implemented in bibliometrix).

Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.

yjunechoe commented 2 years ago

Right now, nodes$role are citing, cited, both, and target. Are these the best levels? Or should we make it clearer: "cites target", "cited by target", "both", "target"? Any other ideas?

This also has me kind of stumped, especially if there are multiple input papers. But as long as it's a factor the docs can clarify on what each level means.

Does from A to B means A cites B a good convention? Or should it be the other way around?

I think this makes more sense than the other way around! I suppose the documentation could emphasize this point more but it's not difficult to me

trangdata commented 2 years ago

Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.

I think we need to be careful referencing "publication_date", since there may be an online early release date (or similar) and there may be a final date later (usually months after). If this paper is cited by another article that gets published just before this final date, it's going to be difficult to infer the citation direction. And by "both" here, I only mean that, in a particular query, this one article A is cited by (at least) one of the identifiers (input articles) and A also cites another identifier. Does that make sense? Or should we use a completely different term to indicate "not target"? target and neighbor? old and new? source and peripheral? hub and other? anything else?

I'm also okay with removing this column altogether, although I still think "role" is helpful in flagging the original identifiers. Another idea is to make a boolean column named "input" or "provided", TRUE if the work is in the original list of identifiers, else FALSE.

(Misread June's comment initially)

> I think this makes more sense than the other way around! I suppose the documentation could emphasize this point more but it's not difficult to me Could you explain why this is? @yjunechoe In my head, A -> B means A points up to B, so B is the ancestor/A cites B.

massimoaria commented 2 years ago

Regarding the roles stored in nodes$role, I think that the "both" level is unnecessary because a scientific publication (e.g., A) can only cite previously published papers (e.g., B and C). B and C will not be able to cite A because the latter did not exist on the date they were published.

I think we need to be careful referencing "publication_date", since there may be an online early release date (or similar) and there may be a final date later (usually months after). If this paper is cited by another article that gets published just before this final date, it's going to be difficult to infer the citation direction. And by "both" here, I only mean that, in a particular query, this one article A is cited by (at least) one of the identifiers (input articles) and A also cites another identifier. Does that make sense? Or should we use a completely different term to indicate "not target"? target and neighbor? old and new? source and peripheral? hub and other? anything else?

Yes, You are right. This is a good point.

I'm also okay with removing this column altogether, although I still think "role" is helpful in flagging the original identifiers.

I think this makes more sense than the other way around! I suppose the documentation could emphasize this point more but it's not difficult to me

Could you explain why this is? @yjunechoe In my head, A -> B means A points up to B, so B is the ancestor/A cites B.

Yes, A -> B means A cites B then B is the ancestor. So, in bibliometrics, the "citation action" comes from A to B. That is the information we report in the relationships matrix.

yjunechoe commented 2 years ago

Could you explain why this is? @yjunechoe In my head, A -> B means A points up to B, so B is the ancestor/A cites B.

Oh I was agreeing with you that "from A to B" means "A cites B" to me! Like "there is a citation from A to B"

trangdata commented 2 years ago

Oh I was agreeing with you that "from A to B" means "A cites B" to me! Like "there is a citation from A to B"

Oops my bad I read your comment without "than" and its meaning is reverse. 🙃

trangdata commented 2 years ago

Back to the role column: Instead, what do you both think about a boolean column named "oa_input" or ..?, TRUE if the work is in the original list of identifiers, else FALSE.

trangdata commented 2 years ago

Resolved in #22. Remaining discussion moved to #23.

ropensci / openalexR

Snowball search and some other ideas #9