ropensci / jstor

Import journal data from DfR (JSTOR)
https://docs.ropensci.org/jstor
47 stars 9 forks source link

Rewrite extractions for speed #74

Open tklebel opened 5 years ago

tklebel commented 5 years ago

Jim has given a good hint how some functions could be faster (https://github.com/r-lib/xml2/issues/237). This should apply to authors and possibly to book chapters too. For references this currently does not work for various reasons, first of all, since we sometimes want to extract more than one item.

tklebel commented 5 years ago

Solution by Jim: use a list column and treat the values all at once downstream. Then we can use vectorized functions which should speed up the process quite a bit.

library(xml2)
library(magrittr)

xml <- read_xml('<ref-list>
  <ref id="ref1">
      <citation type="book">
        <title>Some title</title>
        <publisher>Some publisher</publisher>
      </citation>
   </ref>
   <ref id="ref2">
      <citation type="journal">
        <title>Another title</title>
        <volume>1</volume>
        <author>Author 1/1</author>
        <author>Author 1/2</author>
      </citation>
   </ref>
</ref-list>')

refs <- xml_find_all(xml, ".//ref")

res <- tibble::tibble(
  title = refs %>% xml_find_first("./citation/title") %>% xml_text(),
  volume = refs %>% xml_find_first("./citation/volume") %>% xml_text(),
  publisher = refs %>% xml_find_first("./citation/publisher") %>% xml_text(),
  author = refs %>% purrr::map(~ xml_find_all(.x, "./citation/author") %>% xml_text())
)

res
#> # A tibble: 2 x 4
#>   title         volume publisher      author   
#>   <chr>         <chr>  <chr>          <list>   
#> 1 Some title    <NA>   Some publisher <chr [0]>
#> 2 Another title 1      <NA>           <chr [2]>

res$author
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> [1] "Author 1/1" "Author 1/2"