ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
195 stars 38 forks source link

`entrez_summary()` fails silently at > 500 results #106

Closed npjc closed 7 years ago

npjc commented 7 years ago
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
library(rentrez)
db <- "assembly"
term <- "Saccharomyces[ORGN]"
r <- entrez_search(db, term = term, retmax = 1) # just to get the count
r <- entrez_search(db, term = term, retmax = r$count) # return all ids
length(r$ids) == r$count # sanity check
#> [1] TRUE

getting summary of all results fails (silently?)

s <- entrez_summary(db, id = r$ids)
length(s) == length(r$ids) #uh oh 
#> [1] FALSE

getting summaries of the first works as expected

s_first_500 <- entrez_summary(db, id = r$ids[1:500])
length(s_first_500) == 500 # so retrieving 500 summaries at once works...
#> [1] TRUE
head(s_first_500, 2) %>% str() # looks good
#> List of 2
#>  $ 1087661:List of 49
#>   ..$ uid                        : chr "1087661"
#>   ..$ rsuid                      : chr ""
#>   ..$ gbuid                      : chr "4436668"
#>   ..$ assemblyaccession          : chr "GCA_900178065.1"
#>   ..$ lastmajorreleaseaccession  : chr "GCA_900178065.1"
#>   ..$ chainid                    : chr "900178065"
#>   ..$ assemblyname               : chr "L711"
#>   ..$ ucscname                   : chr ""
#>   ..$ ensemblname                : chr ""
#>   ..$ taxid                      : chr "4932"
#>   ..$ organism                   : chr "Saccharomyces cerevisiae (baker's yeast)"
#>   ..$ speciestaxid               : chr "4932"
#>   ..$ speciesname                : chr "Saccharomyces cerevisiae"
#>   ..$ assemblytype               : chr "haploid"
#>   ..$ assemblyclass              : chr "haploid"
#>   ..$ assemblystatus             : chr "Scaffold"
#>   ..$ wgs                        : chr "FXLH01"
#>   ..$ gb_bioprojects             :'data.frame':  1 obs. of  2 variables:
#>   .. ..$ bioprojectaccn: chr "PRJEB8455"
#>   .. ..$ bioprojectid  : int 308667
#>   ..$ gb_projects                : chr "308667"
#>   ..$ rs_bioprojects             : list()
#>   ..$ rs_projects                : list()
#>   ..$ biosampleaccn              : chr "SAMEA3249812"
#>   ..$ biosampleid                : chr "4395280"
#>   ..$ biosource                  :List of 3
#>   .. ..$ infraspecieslist: list()
#>   .. ..$ sex             : chr ""
#>   .. ..$ isolate         : chr ""
#>   ..$ coverage                   : chr "50"
#>   ..$ partialgenomerepresentation: chr "false"
#>   ..$ primary                    : chr "4436658"
#>   ..$ assemblydescription        : chr ""
#>   ..$ releaselevel               : chr "Major"
#>   ..$ asmreleasedate_genbank     : chr "2017/05/03 00:00"
#>   ..$ asmreleasedate_refseq      : chr "1/01/01 00:00"
#>   ..$ seqreleasedate             : chr "2017/04/27 00:00"
#>   ..$ asmupdatedate              : chr "2017/05/03 00:00"
#>   ..$ submissiondate             : chr "2017/04/27 00:00"
#>   ..$ lastupdatedate             : chr "2017/05/03 00:00"
#>   ..$ submitterorganization      : chr "INRA"
#>   ..$ refseq_category            : chr "na"
#>   ..$ anomalouslist              : list()
#>   ..$ exclfromrefseq             : list()
#>   ..$ propertylist               : chr [1:4] "full-genome-representation" "latest" "latest_genbank" "wgs"
#>   ..$ fromtype                   : chr ""
#>   ..$ synonym                    :List of 3
#>   .. ..$ genbank   : chr "GCA_900178065.1"
#>   .. ..$ refseq    : chr ""
#>   .. ..$ similarity: chr ""
#>   ..$ ftppath_genbank            : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711"
#>   ..$ ftppath_refseq             : chr ""
#>   ..$ ftppath_assembly_rpt       : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711/GCA_900178065.1_L711_assembly_report.txt"
#>   ..$ ftppath_stats_rpt          : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/178/065/GCA_900178065.1_L711/GCA_900178065.1_L711_assembly_stats.txt"
#>   ..$ ftppath_regions_rpt        : chr ""
#>   ..$ sortorder                  : chr "5C99001780659899"
#>   ..$ meta                       : chr " &lt;Stats&gt; &lt;Stat category=\"alt_loci_count\" sequence_tag=\"all\"&gt;0&lt;/Stat&gt; &lt;Stat category=\""| __truncated__
#>   ..- attr(*, "class")= chr [1:2] "esummary" "list"
#>  $ 1082101:List of 49
#>   ..$ uid                        : chr "1082101"
#>   ..$ rsuid                      : chr ""
#>   ..$ gbuid                      : chr "4417938"
#>   ..$ assemblyaccession          : chr "GCA_900177905.1"
#>   ..$ lastmajorreleaseaccession  : chr "GCA_900177905.1"
#>   ..$ chainid                    : chr "900177905"
#>   ..$ assemblyname               : chr "L564"
#>   ..$ ucscname                   : chr ""
#>   ..$ ensemblname                : chr ""
#>   ..$ taxid                      : chr "4932"
#>   ..$ organism                   : chr "Saccharomyces cerevisiae (baker's yeast)"
#>   ..$ speciestaxid               : chr "4932"
#>   ..$ speciesname                : chr "Saccharomyces cerevisiae"
#>   ..$ assemblytype               : chr "haploid"
#>   ..$ assemblyclass              : chr "haploid"
#>   ..$ assemblystatus             : chr "Scaffold"
#>   ..$ wgs                        : chr "FXEF01"
#>   ..$ gb_bioprojects             :'data.frame':  1 obs. of  2 variables:
#>   .. ..$ bioprojectaccn: chr "PRJEB8455"
#>   .. ..$ bioprojectid  : int 308667
#>   ..$ gb_projects                : chr "308667"
#>   ..$ rs_bioprojects             : list()
#>   ..$ rs_projects                : list()
#>   ..$ biosampleaccn              : chr "SAMEA3249808"
#>   ..$ biosampleid                : chr "4395276"
#>   ..$ biosource                  :List of 3
#>   .. ..$ infraspecieslist: list()
#>   .. ..$ sex             : chr ""
#>   .. ..$ isolate         : chr ""
#>   ..$ coverage                   : chr "50"
#>   ..$ partialgenomerepresentation: chr "false"
#>   ..$ primary                    : chr "4417928"
#>   ..$ assemblydescription        : chr ""
#>   ..$ releaselevel               : chr "Major"
#>   ..$ asmreleasedate_genbank     : chr "2017/04/26 00:00"
#>   ..$ asmreleasedate_refseq      : chr "1/01/01 00:00"
#>   ..$ seqreleasedate             : chr "2017/04/25 00:00"
#>   ..$ asmupdatedate              : chr "2017/04/26 00:00"
#>   ..$ submissiondate             : chr "2017/04/25 00:00"
#>   ..$ lastupdatedate             : chr "2017/04/26 00:00"
#>   ..$ submitterorganization      : chr "INRA"
#>   ..$ refseq_category            : chr "na"
#>   ..$ anomalouslist              : list()
#>   ..$ exclfromrefseq             : list()
#>   ..$ propertylist               : chr [1:4] "full-genome-representation" "latest" "latest_genbank" "wgs"
#>   ..$ fromtype                   : chr ""
#>   ..$ synonym                    :List of 3
#>   .. ..$ genbank   : chr "GCA_900177905.1"
#>   .. ..$ refseq    : chr ""
#>   .. ..$ similarity: chr ""
#>   ..$ ftppath_genbank            : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564"
#>   ..$ ftppath_refseq             : chr ""
#>   ..$ ftppath_assembly_rpt       : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564/GCA_900177905.1_L564_assembly_report.txt"
#>   ..$ ftppath_stats_rpt          : chr "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/177/905/GCA_900177905.1_L564/GCA_900177905.1_L564_assembly_stats.txt"
#>   ..$ ftppath_regions_rpt        : chr ""
#>   ..$ sortorder                  : chr "5C99001779059899"
#>   ..$ meta                       : chr " &lt;Stats&gt; &lt;Stat category=\"alt_loci_count\" sequence_tag=\"all\"&gt;0&lt;/Stat&gt; &lt;Stat category=\""| __truncated__
#>   ..- attr(*, "class")= chr [1:2] "esummary" "list"

but try and get the first 501 results and then 💥

s_first_501 <- entrez_summary(db, id = r$ids[1:501])
length(s) == 501 # uh oh... so 500 limit somewhere?
#> [1] FALSE
head(s) %>% str() # empty list
#>  list()

# session info:
sessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.4.0 (2017-04-21)
#>  os       macOS Sierra 10.12.4        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       America/Vancouver           
#>  date     2017-05-14                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       source                             
#>  assertthat    0.2.0      2017-04-11 CRAN (R 3.4.0)                     
#>  backports     1.0.5      2017-01-18 CRAN (R 3.4.0)                     
#>  broom         0.4.2      2017-02-13 CRAN (R 3.4.0)                     
#>  cellranger    1.1.0      2016-07-27 CRAN (R 3.4.0)                     
#>  clisymbols    1.1.0      2017-01-27 cran (@1.1.0)                      
#>  colorspace    1.3-2      2016-12-14 CRAN (R 3.4.0)                     
#>  curl          2.6        2017-04-27 CRAN (R 3.4.0)                     
#>  DBI           0.6-1      2017-04-01 CRAN (R 3.4.0)                     
#>  digest        0.6.12     2017-01-27 CRAN (R 3.4.0)                     
#>  dplyr       * 0.5.0      2016-06-24 CRAN (R 3.4.0)                     
#>  emo           0.0.0.9000 2017-05-14 Github (hadley/emo@4be1aa3)        
#>  evaluate      0.10       2016-10-11 CRAN (R 3.4.0)                     
#>  forcats       0.2.0      2017-01-23 CRAN (R 3.4.0)                     
#>  foreign       0.8-68     2017-04-24 CRAN (R 3.4.0)                     
#>  ggplot2     * 2.2.1      2016-12-30 CRAN (R 3.4.0)                     
#>  gtable        0.2.0      2016-02-26 CRAN (R 3.4.0)                     
#>  haven         1.0.0      2016-09-23 CRAN (R 3.4.0)                     
#>  hms           0.3        2016-11-22 CRAN (R 3.4.0)                     
#>  htmltools     0.3.6      2017-04-28 CRAN (R 3.4.0)                     
#>  httr          1.2.1      2016-07-03 CRAN (R 3.4.0)                     
#>  jsonlite      1.4        2017-04-08 CRAN (R 3.4.0)                     
#>  knitr         1.15.20    2017-05-02 Github (yihui/knitr@f3a490b)       
#>  lattice       0.20-35    2017-03-25 CRAN (R 3.4.0)                     
#>  lazyeval      0.2.0      2016-06-12 CRAN (R 3.4.0)                     
#>  lubridate     1.6.0      2016-09-13 CRAN (R 3.4.0)                     
#>  magrittr      1.5        2014-11-22 CRAN (R 3.4.0)                     
#>  mnormt        1.5-5      2016-10-15 CRAN (R 3.4.0)                     
#>  modelr        0.1.0      2016-08-31 CRAN (R 3.4.0)                     
#>  munsell       0.4.3      2016-02-13 CRAN (R 3.4.0)                     
#>  nlme          3.1-131    2017-02-06 CRAN (R 3.4.0)                     
#>  plyr          1.8.4      2016-06-08 CRAN (R 3.4.0)                     
#>  psych         1.7.3.21   2017-03-22 CRAN (R 3.4.0)                     
#>  purrr       * 0.2.2      2016-06-18 CRAN (R 3.4.0)                     
#>  R6            2.2.0      2016-10-05 CRAN (R 3.4.0)                     
#>  Rcpp          0.12.10    2017-03-19 CRAN (R 3.4.0)                     
#>  readr       * 1.1.0      2017-03-22 CRAN (R 3.4.0)                     
#>  readxl        1.0.0      2017-04-18 CRAN (R 3.4.0)                     
#>  rentrez     * 1.0.4      2016-10-26 CRAN (R 3.4.0)                     
#>  reshape2      1.4.2      2016-10-22 CRAN (R 3.4.0)                     
#>  rmarkdown     1.5        2017-04-26 CRAN (R 3.4.0)                     
#>  rprojroot     1.2        2017-01-16 CRAN (R 3.4.0)                     
#>  rvest         0.3.2      2016-06-17 CRAN (R 3.4.0)                     
#>  scales        0.4.1      2016-11-09 CRAN (R 3.4.0)                     
#>  sessioninfo   0.0.0.9000 2017-04-26 Github (r-pkgs/sessioninfo@0a5b58f)
#>  stringi       1.1.5      2017-04-07 CRAN (R 3.4.0)                     
#>  stringr       1.2.0      2017-02-18 CRAN (R 3.4.0)                     
#>  tibble      * 1.3.0      2017-04-01 CRAN (R 3.4.0)                     
#>  tidyr       * 0.6.1      2017-01-10 CRAN (R 3.4.0)                     
#>  tidyverse   * 1.1.1      2017-01-27 CRAN (R 3.4.0)                     
#>  withr         1.0.2      2016-06-20 CRAN (R 3.4.0)                     
#>  XML           3.98-1.6   2017-03-30 CRAN (R 3.4.0)                     
#>  xml2          1.1.1      2017-01-24 CRAN (R 3.4.0)                     
#>  yaml          2.1.14     2016-11-12 CRAN (R 3.4.0)
npjc commented 7 years ago

not sure but perhaps related to #105

dwinter commented 7 years ago

Hi @npjc , Thanks for you detailed bug report -- it's really helpful to have all this information.

Looks like this is indeed the same problem as #105: NCBI is giving an error for JSON requests > 500 records and rentrez is failing to pass it on. If you want to fetch more than 500 in one go check out the records you get with version=1.0 (not they will be slightly different than the version 2.0/JSON records).

Failing that you will need to batch up the IDs into lots of 500 then stich the results you are interested in back together.

I will leave this issue open until I have a useful error message in cases like this.

sckott commented 7 years ago

Was about to open an issue - getting same problem using entrez_summary - looking forward to the http error message

dwinter commented 7 years ago

Sorry guys, on the TODO list but the TODO list is long at the moment!

dwinter commented 7 years ago

OK, just pushed some changes to the develop branch that should take care of these problems. Will merge to the master and get on CRAN in the next few days