ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

entrez_summary(), entrez_fetch() parse errors #74

Closed gadepallivs closed 7 years ago

gadepallivs commented 8 years ago

I get the following error for my code that has been working perfect for a while now. Error in as.vector: cannot coerce type 'externalptr' to vector of type 'character'. We discussed this error earlier and the cause of it was empty records . I fixed it with an if condition to overcome blank records issue. However, this time it is unpredictable. As of now, when I repeatedly run the code with new search query every time ( and for hits >100) it breaks at this line below entrez_fetch(). What I noticed is when I go back and re-run just that part of the code, it runs successfully (it is also subjective to number of hits, time it takes to run entrez_fetch() function etc). Not sure, if you can reproduce it or you came across this error. Will too many hits to a search query cause overload on entrez_fetch() ? Please find the code below.

query = "BRAF[Title/Abstract] OR  melanoma[Title/Abstract] AND Cancer[Title]
AND (2000[PDAT] :2015[PDAT])"
pubmed_search <- entrez_search(db = "pubmed", term = query,
                               use_history = TRUE)
pubmed_search

# At first run, I got 6976 hits. Then, `entrez_fetch()` after pause for 1-2 min
# threw an error saying `Error in as.vector: cannot coerce type 'externalptr'
# to vector of type 'character'`
fetch.pubmed <- entrez_fetch(
  db = "pubmed", web_history = pubmed_search$web_history,
  rettype = "xml", parsed = T
)
# When I went back and re-run only the above function it worked fine.
# Next, I ran the below code now it throws ...
# `Error: parse error: premature EOF                 (right here) ------^`
pub.summary <- entrez_summary(
  db = "pubmed", version = "2.0",
  web_history = pubmed_search$web_history,
  always_return_list = TRUE
)
pubrecord.extract <-
  extract_from_esummary(
    pub.summary,
    elements = c(
      "uid","title",
      "fulljournalname",
      "pubtype", "volume",
      "issue", "pages","sortfirstauthor",
      "lastauthor",
      "pmcrefcount",
      "issn", "pubdate"
    ),
    simplify = T
  )

Note: If you change the OR to AND in the query, it works fine. Also, the number of hits now will be 117. Also, I have read your solution on #70. parsing large queries, I am working to implement if I have to do that, I guess I need to parse the XML individually for all the elements that extract_from_esummary will extract for me in a single function.

dwinter commented 8 years ago

Hi Monty,

I'll have a look and see if there is anything rentrez can do in cases like this. In the mean time, yes, try and chunk very large requests into several smaller ones. retmax and retstart are the arguments to use, there is an example in teh vigenette.

gadepallivs commented 8 years ago

Hi David, I passed my query to the function you created in #70. It still throws me the same error in as.vector: cannot coerce type 'externalptr' to vector of type 'character'

dwinter commented 8 years ago

Hi Monty, I'm afraid I can't reproduce the error. Can you tell me what you get when you don't set parsed=TRUE?

gadepallivs commented 8 years ago

Hi David, I tried that, I get the same error after repeated attempts irrespective of Parse = T or F. Were you able to reproduce the 2nd error with entrez_summary() in the question above ? Error: parse error: premature EOF (right here) ------^. It used to work fine for me as well, not sure what is wrong in my query now. I have updated R to 3.2.3, not sure if it matters my code works for smaller queries but breaks for large ones.

query = "BRAF[Title/Abstract] OR  melanoma[Title/Abstract] AND Cancer[Title]
AND (2000[PDAT] :2015[PDAT])"

fetch_and_parse <- function(start) {
  cat(start,"\r") #let the user now where
  pubmed_records <-
    entrez_fetch(
      db = "pubmed", web_history = pubmed_search$web_history,
      retstart = start, retmax = 1000, rettype =
        "xml"
    )
  parse_pubmed_xml(pubmed_records)
}
pubmed_search <-
  entrez_search(db = "pubmed", term = query, use_history = TRUE)
pubmed_parsed <- lapply(pubmed_search, fetch_and_parse)
dput(pubmed_search)
dwinter commented 8 years ago

Hi Monty,

The error message for entrez_fetch is from the XML package, so I don't think you should be getting if parse is set to FALSE.

If you can reliably get this error we might be able to get to the bottom of it . Can you run this following code? If everythng goes fine er will be NULL. If it goes wrong er will be the raw file that is messing everything up.

query = "BRAF[Title/Abstract] OR  melanoma[Title/Abstract] AND Cancer[Title]
AND (2000[PDAT] :2015[PDAT])"
pubmed_search <- entrez_search(db = "pubmed", term = query,
                               use_history = TRUE)

did_it_parse <- function(recs){
  flag <- tryCatch(
    XML::xmlTreeParse(recs, useInternalNodes=TRUE),
    error = function(e) "FAIL"
  )
  if(typeof(flag) == "character"){
    return(FALSE)
  }
  TRUE
}

trap_error <- function(){                    
  res <- rentrez:::make_entrez_query(
                   "efetch", config=NULL, 
                   WebEnv=pubmed_search$web_history$WebEnv, 
                   query_key=pubmed_search$web_history$QueryKey,
                   rettype="xml",
                  db="pubmed", retmax=1000)
  cat("res is a '", typeof(res), "'\n")
  if(did_it_parse(res)){
    return(invisible())
  }
  res
} 

er <- trap_error()

For making progress on your own work. You just need to use fewer records ata time (change retmax to suit) so these large files don't cause these errors.

gadepallivs commented 8 years ago

Hi david, Thank you for your time. The er is NULL.. I am posting a traceback() of the errors I get in the question. I will work on troubleshooting and see if I can provide you a repeatable error. Below is the traceback(). Just in case it may help to pinpoint error. isn 12 Stop("HTTP failure is expected ?. Could it be the time it takes to fetch pubmed for large list of pmids is causing time out of the connection and thus throwing error ?

Error in as.vector(x, "character") : 
  cannot coerce type 'externalptr' to vector of type 'character' 
16 as.character.default(X[[i]], ...) 
15 FUN(X[[i]], ...) 
14 lapply(list(...), as.character) 
13 .makeMessage(..., domain = domain) 
12 stop("HTTP failure: ", req$status_code, "\n", message, call. = FALSE) 
11 entrez_check(response) 
10 (function (util, config, interface = ".fcgi?", by_id = FALSE, 
    ...) 
{
    uri <- paste0("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/",  ... 
9 do.call(make_entrez_query, args) 
8 entrez_fetch(db = "pubmed", web_history = pubmed_search$web_history, 
    retstart = start, retmax = 1000, rettype = "xml") at .active-rstudio-document#6
7 FUN(X[[i]], ...) at .active-rstudio-document#4
6 lapply(pubmed_search, fetch_and_parse) at .active-rstudio-document#16
5 eval(expr, envir, enclos) 
4 eval(ei, envir) 
3 withVisible(eval(ei, envir)) 
2 source("~/.active-rstudio-document") 
1 source("~/.active-rstudio-document") 
dwinter commented 8 years ago

Thanks Monty, this is helpful.

I'm not sure why you are doing lapply(pubmed_search, fetch_and_parse)? fetch_and_parse takes a starting number as argument, not a pubmed search.

Also, it looks like the error occurs when rentrez tries to deal with the error message. Could you install a tweaked version:

devtools::install_github("rentrez", "ropensci", ref="monty")

Presuming these errors persist, let me know what error messages you now get form entrez_fetch

gadepallivs commented 8 years ago

Hi David,

5:stop("HTTP failure: ", req$status_code, call. = FALSE)
4:entrez_check(response)
3:(function (util, config, interface = ".fcgi?", by_id = FALSE,  ...)
{
  uri <- paste0("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/",
                util, interface)
  args <- list(..., email = entrez_email(), tool = entrez_tool())
  if (by_id) {
    ids_string <- paste0("&id=", args$id, collapse = "")
    args$id <- NULL
    uri <- paste0(uri, ids_string)
  }
  else {
    if ("id" %in% names(args)) {
      args$id <- paste(args$id, collapse = ",")
    }
  }
  response <- httr::GET(uri, query = args, config = config)
  entrez_check(response)
  return(httr::content(response, as = "text"))
})(
  "esummary", db = "pubmed", config = NULL, retmode = "json",
  version = "2.0",
  WebEnv =
    "NCID_1_4959508_130.14.22.215_9001_1453991749_1393371199_0MetA0_S_MegaStore_F_1",
  query_key = "1"
)
2:do.call(make_entrez_query, args)
1:entrez_summary(
  db = "pubmed", version = "2.0", web_history = pubmed_search$web_history,
  always_return_list = TRUE
)
gadepallivs commented 8 years ago

Hi david, Today morning when I tried running my code, it did not throw me errors when hits are 100-150 . Earlier, anything more than 50 hits used to throw errors. However, anything more than 200 hits still consistently breaks at entrez_summary() with Error: HTTP failure: 502

dwinter commented 8 years ago

Hi @Monty9 ,

Sorry I don't have much time to work on these present. All I can tell you is that "502" is an internal server error (computers on the NCBI side not talking to each other as expected).

In general, it's a good idea "chunk" large requests into smaller subsets. The NCBI does seem to get flakey at times (they suggest only doing large jobs at "off peak" (USA) times). There's not much that rentrez can do about that. But I'll see if we can capture the errors or provide useful documentation about these problems. You might consider something similar for your web app (i.e. providing users with informative messages when you run into these errors).

gadepallivs commented 8 years ago

Hi David, Thank you for the response. I will incorporate your suggestion. Do I need to install the tweaked version every time ? My understanding is after you suggested me to install the tweaked version I see these specific errors 502 and 400. How long the tweaked version is valid ?

dwinter commented 8 years ago

Hi monty.

I will probably improve the error handling in the main version of rentrez in the next week or so, at least to always return text errors. I'll let you know when it's done.

On Mon, Feb 1, 2016 at 12:26 PM, Monty9 notifications@github.com wrote:

Hi David, Thank you for the response. I will incorporate your suggestion. Do I need to install the tweaked version every time ? My understanding is after you suggested me to install the tweaked version I see these specific errors 502 and 400. How long the tweaked version is valid ?

— Reply to this email directly or view it on GitHub https://github.com/ropensci/rentrez/issues/74#issuecomment-178140915.

David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University

ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism

arnome commented 8 years ago

Hi,

first many thanks for this package, it's very usefull. I've try the dev version of rentrez (1.0.1) with R version 3.2.3 (2015-12-10) and I want to use _entrezsummary() with the _webhistory option. In my session, just the version "1.0" works fine.

es <- entrez_search(db = "pubmed", query, use_history = TRUE)
> esum_1 <- entrez_summary(db="pubmed", web_history = es$web_history,version="1.0")
> esum_1
List of  5161 esummary records. First record:

The version "2.0" leads to a parse error message

> esum_2 <- entrez_summary(db="pubmed", web_history = es$web_history,version="2.0")
Erreur : parse error: premature EOF

                     (right here) ------^

or a Erreur : HTTP failure: 500

Erreur : HTTP failure: 500
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>NCBI/eutils211 - WWW Error 500 Diagnostic</title>
<style type="text/css"><![CDATA[ 
h1.error {color: red; font-size: 40pt}
div.diags {text-indent: 0.5in }
]]></style> 
</head>
<body>
<h1>Server Error</h1>

<p>Your request could not be processed due to a problem on
our Web server.  This could be a transient problem, please
try the query again.  If it doesn't clear up within a
reasonable period of time, e-mail a short description of your
query and the diagnostic information shown below to:</p>

<p>
pubmed@nlm.nih.gov - for problems with PubMed<br/>
webadmin@ncbi.nlm.nih.gov - for problems with other services<br/>
</p>

<p>Thank you for your assistance.  We will try to fix the
problem as soon as possible.
</p>
<hr/>
<p>
Diagnostic Information:</p>

<div class="diags">Error: 500</div>

I don't know how to deal with that error. Any suggestions ?

dwinter commented 8 years ago

Bonjour @arnome, and thanks for this report.

These "transient" errors are most likely to happen with large (> a few hundred) requests. It's probably a good idea to "chunk" these into smaller requests. With the web_history approach that means using retstart and retmax, someting similar to the last chunk in this vignette example

arnome commented 8 years ago

Hi David, thank's for your reply, it works great now. I leave a function I've wrote directly inspire by the vignette

# Function : full_entrez_summary(db,es,step)
# in : db : db name to search with, es entrez_search object, step : the step 
# out : a esummary_list of esummary
# ex : summaries <- full_entrez_summary("pubmed",es,50)
full_entrez_summary <- function(db,es,step)
{
  # get partial summaries step by step 
  for(i in seq(0,es$count,step)){
    esum <- entrez_summary(db=db, web_history = es$web_history, version = "2.0", always_return_list = TRUE,retstart=i, retmax = step)
    # if i not the first step append
    if (i != 0){
      esum_t  <- append(esum_t,esum)
    # if i is the first just memorize 
    }else{
      esum_t <- esum
    }
  }
  # reattribute the right class (esummary_list) lost with append()
  class(esum_t) <- c("esummary_list", "list")
  return(esum_t)
}

be seeing you, arnome.

dwinter commented 8 years ago

Cheers @arnome,

Could idea with re-adding the class. I have to include a note about this in the vignette!

dwinter commented 8 years ago

Hey @Monty9, the new master branch should handle http error codes smoothly. Do you want to check it out?

gadepallivs commented 8 years ago

Hi david, Sure. I would like to try that. Do I need to run any update ?

dwinter commented 8 years ago

Hi @Monty9,

Yeah, the new release is on CRAN now, sow any of git pull and local install, devtools::install_github and install.packages() should work :)