ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
98 stars 21 forks source link

Error when trying to plot graph - is it `openalexR`, #126

Closed rkrug closed 1 year ago

rkrug commented 1 year ago

Hi I am trying to use snowball and the graphing using tidygraph and ggraph on a set of (at the moment) 3 works (ids is the example). But I get the error below. If I exclude the first work, it works. So I assume, the results returned from OpenAlex are some info missing? Any ideas? Thanks,

Rainer

I am unsure, if it is

library(openalexR)
library(tidygraph)

ids <- c("W1896013598", "W312683970", "W2084630927")

ilk_snowball <- oa_snowball(
    identifier = ids,
    verbose = FALSE
)

as_tbl_graph(ilk_snowball)
Error in (function (edges, n = max(edges), directed = TRUE)  : 
  At core/constructors/basic_constructors.c:72 : Invalid (non-finite or NaN) vertex index when creating graph. Invalid value
yjunechoe commented 1 year ago

Thanks - I can replicate this error. This happens because one of the edges connect to a paper that is not present in nodes:

library(openalexR)
library(tidygraph)

ids <- c("W1896013598", "W312683970", "W2084630927")

ilk_snowball <- oa_snowball(
  identifier = ids,
  verbose = FALSE
)

edge_to_matches <- match(ilk_snowball$edges$to, ilk_snowball$nodes$id)
unmatched <- ilk_snowball$edges[which(is.na(edge_to_matches)), ]$to
unmatched
#> [1] "W1488199547"

For now the tbl_graph conversion works if you remove that edge that points to a missing node:

ilk_snowball$edges <- ilk_snowball$edges[ilk_snowball$edges$to != unmatched, ]
as_tbl_graph(ilk_snowball)
#> # A tbl_graph: 301 nodes and 298 edges
#> #
#> # A rooted forest with 3 trees
#> #
#> # A tibble: 301 × 31
#>   id    display_name author ab    publication_date so    so_id host_organization
#>   <chr> <chr>        <list> <chr> <chr>            <chr> <chr> <chr>            
#> 1 W189… Reforesting… <df>   ""    2009-09-01       AMBI… http… Royal Swedish Ac…
#> 2 W312… Does outmig… <df>   "In … 2015-08-01       Appl… http… Elsevier BV      
#> 3 W208… Lake victor… <df>   "The… 2004-05-01       Limn… http… Elsevier BV      
#> 4 W210… Classical b… <df>   "Of … 2010-08-01       Biol… http… Elsevier BV      
#> 5 W198… Payments fo… <df>   "Rec… 2012-05-01       Geof… http… Elsevier BV      
#> 6 W288… The role of… <df>   "Inv… 2019-01-01       Jour… http… Elsevier BV      
#> # ℹ 295 more rows
#> # ℹ 23 more variables: issn_l <chr>, url <chr>, pdf_url <chr>, license <chr>,
#> #   version <chr>, first_page <chr>, last_page <chr>, volume <chr>,
#> #   issue <chr>, is_oa <lgl>, cited_by_count <int>, counts_by_year <list>,
#> #   publication_year <int>, cited_by_api_url <chr>, ids <list>, doi <chr>,
#> #   type <chr>, referenced_works <list>, related_works <list>,
#> #   is_paratext <lgl>, is_retracted <lgl>, concepts <list>, oa_input <lgl>
#> #
#> # A tibble: 298 × 2
#>    from    to
#>   <int> <int>
#> 1     4     3
#> 2     5     1
#> 3     6     3
#> # ℹ 295 more rows

Turns out that the missing node W1488199547 is a Deleted Work, so may have gotten filtered out early:

openalexR::oa_fetch(identifier = "W1488199547")[,1:2]
#> # A tibble: 1 × 2
#>   id                               display_name
#>   <chr>                            <chr>       
#> 1 https://openalex.org/W4285719527 Deleted Work

This is confusing all-in-all so maybe oa_snowball should at least check for validity between nodes and edges. Anyways, thanks for the report!

rkrug commented 1 year ago

I agree. Thanks for looking into this.