jkeirstead / scholar

Analyse citation data from Google Scholar
Other
312 stars 83 forks source link

Years with zero citations cause get_article_cite_history() to fail #102

Open joelmcg opened 3 years ago

joelmcg commented 3 years ago

When using get_article_cite_history(), an article with years with zero citations will cause one of two errors. First, there may be an error message indicating that the length of years is incompatible with vals:

get_article_cite_history("wSXViPYAAAAJ", "KlAtU1dfN6UC") Error in data.frame(year = years, cites = vals) : arguments imply differing number of rows: 17, 16

Second, years that should be zero may be filled in with the incorrect values:

get_article_cite_history("wSXViPYAAAAJ", "9ZlFYXVOiuMC") year cites pubid 1 2005 1 9ZlFYXVOiuMC 2 2006 1 9ZlFYXVOiuMC 3 2007 1 9ZlFYXVOiuMC 4 2008 1 9ZlFYXVOiuMC 5 2009 1 9ZlFYXVOiuMC 6 2010 1 9ZlFYXVOiuMC 7 2011 1 9ZlFYXVOiuMC 8 2012 1 9ZlFYXVOiuMC 9 2013 1 9ZlFYXVOiuMC 10 2014 1 9ZlFYXVOiuMC 11 2015 1 9ZlFYXVOiuMC 12 2016 1 9ZlFYXVOiuMC

The correct citation history for this article contains many zeros:

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wSXViPYAAAAJ&cstart=20&pagesize=80&citation_for_view=wSXViPYAAAAJ:9ZlFYXVOiuMC

Thanks for looking into this!

Cheers, Joel

TerrestrialEcologyLabSERC commented 3 years ago

I am having the same issue!

get_article_cite_history("QtuhiVMAAAAJ", "IjCSPb-OGe4C")

Error in data.frame(year = years, cites = vals) : 
  arguments imply differing number of rows: 16, 15

Have confirmed by looking at the google scholar page that it is articles that have years with no citations that is the problem.

Thank you so much!

rmwaterhouse commented 2 years ago

Also having the same issue ... took a while to figure it out - any year with zero citations causes get_article_cite_history to die.

mkiang commented 2 years ago

I'm pretty sure the issue has to do with a dependency and/or conflict upstream. If I modify get_article_cite_history() such that the only thing I change is to make the rvest namespace explicit for related functions, everything works as intended.

For example, here is the original get_article_cite_history() function:

get_article_cite_history <- function(id, article) {
{
    site <- getOption("scholar_site")
    id <- tidy_id(id)
    url_base <- paste0(site, "/citations?", "view_op=view_citation&hl=en&citation_for_view=")
    url_tail <- paste(id, article, sep = ":")
    url <- paste0(url_base, url_tail)
    res <- get_scholar_resp(url)
    if (is.null(res)) 
        return(NA)
    httr::stop_for_status(res, "get user id / article information")
    doc <- read_html(res)
    years <- doc %>% html_nodes(".gsc_oci_g_t") %>% html_text() %>% 
        as.numeric()
    vals <- doc %>% html_nodes(".gsc_oci_g_al") %>% html_text() %>% 
        as.numeric()
    df <- data.frame(year = years, cites = vals)
    if (nrow(df) > 0) {
        df <- merge(data.frame(year = min(years):max(years)), 
            df, all.x = TRUE)
        df[is.na(df)] <- 0
        df$pubid <- article
    }
    else {
        df$pubid <- vector(mode = mode(article))
    }
    return(df)
}

Here is my modified function (called get_article_cite_history_2()):

get_article_cite_history_2 <- function (id, article) {

    site <- getOption("scholar_site")
    id <- tidy_id(id)
    url_base <- paste0(site, "/citations?",
                       "view_op=view_citation&hl=en&citation_for_view=")
    url_tail <- paste(id, article, sep=":")
    url <- paste0(url_base, url_tail)

    res <- get_scholar_resp(url)
    httr::stop_for_status(res, "get user id / article information")
    doc <- rvest::read_html(res)

    ## Inspect the bar chart to retrieve the citation values and years
    years <- doc %>%
        rvest::html_nodes(".gsc_oci_g_a") %>% 
        rvest::html_attr('href') %>% 
        stringr::str_match("as_ylo=(\\d{4})&") %>% 
        "["(,2) %>% 
        as.numeric()
    vals <- doc %>%
        rvest::html_nodes(".gsc_oci_g_al") %>% 
        rvest::html_text() %>% 
        as.numeric()

    df <- data.frame(year = years, cites = vals)
    if(nrow(df)>0) {
        ## There may be undefined years in the sequence so fill in these gaps
        df <- merge(data.frame(year=min(years):max(years)),
                    df, all.x=TRUE)
        df[is.na(df)] <- 0
        df$pubid <- article
    } else {
        # complete the 0 row data.frame to be consistent with normal results
        df$pubid <- vector(mode = mode(article))
    }
    return(df)
}

The output from running each of these:

> scholar::get_article_cite_history("eD9_J3wAAAAJ", "_FxGoFyzp5QC")
Error in data.frame(year = years, cites = vals) : 
  arguments imply differing number of rows: 6, 5
> get_article_cite_history_2("eD9_J3wAAAAJ", "_FxGoFyzp5QC")
  year cites        pubid
1 2016     3 _FxGoFyzp5QC
2 2017     1 _FxGoFyzp5QC
3 2018     0 _FxGoFyzp5QC
4 2019     1 _FxGoFyzp5QC
5 2020     1 _FxGoFyzp5QC
6 2021     5 _FxGoFyzp5QC

A suboptimal workaround right now is to simply replace the get_article_cite_history() function with the one I made above after calling in library(scholar) but this seems like something a dev can patch quickly.