ropensci / jstor

Import journal data from DfR (JSTOR)
https://docs.ropensci.org/jstor
47 stars 9 forks source link

find_references fails silently #48

Closed tklebel closed 6 years ago

tklebel commented 6 years ago

References for articles from "Gènese" are currently not being extracted. Example file: journal-article-10.2307_26197863.xml

This is the responsible function:

extract_ref_content <- function(x) {
  if (identical(xml2::xml_attr(x, "content-type"), "parsed-citations")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      map_chr(collapse_text)

  } else if (is.na(xml2::xml_attr(x, "content-type"))) {
    x %>%
      xml_find_all("title|ref/mixed-citation/node()[not(self::*)]") %>%
      xml_text() %>%
      purrr::keep(str_detect, "[a-z]") %>%
      str_replace("^\\\n", "") # remove "\n" at beginning of strings

  } else if (identical(xml2::xml_attr(x, "content-type"), "unparsed")) {
    x %>%
      xml_find_all("title|ref/mixed-citation") %>%
      xml_text()
  }
}

The content-type of the references is "unparsed-citations" and it therefore fails silently. Solutions: