ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Wrong result with example code for 'scroll' #255

Closed Aeilert closed 5 years ago

Aeilert commented 5 years ago

I don't think the example code for collecting all documents with scroll returns the correct result.

es.con <- elastic::connect()
shakespeare <- system.file("examples", "shakespeare_data_.json", package = "elastic")
invisible(docs_bulk(es.con, shakespeare))
scroll_clear(es.con, all = TRUE)

Compare the current example code:

# Get all results - one approach is to use a while loop
scroll_clear(es.con, all = TRUE)
res <- Search(es.con, index = 'shakespeare', q="a*", time_scroll="5m",
              body = '{"sort": ["_doc"]}')
out <- res$hits$hits
hits <- 1
while(hits != 0){
  res <- scroll(es.con, res$`_scroll_id`, time_scroll="5m")
  hits <- length(res$hits$hits)
  if(hits > 0)
    out <- c(out, res$hits$hits)
}
res$hits$total
$value
[1] 2747

$relation
[1] "eq"

with:

# Clear previous scroll
scroll_clear(es.con, all = TRUE)

#Initiate scroll
out <- list()
l <- 0 
res <- Search(conn = es.con, index = 'shakespeare', time_scroll="5m")
out <- res$hits$hits

# Total number of documents 
n <- res$hits$total$value

# Loop through all results using the scroll method
while(l != n){
  tmp <- scroll(conn = es.con, res$`_scroll_id`)
  out <- append(out, tmp$hits$hits)
  l <- length(out)
}
res$hits$total
$value
[1] 5000

$relation
[1] "eq"

which matches the total number of documents:

count(es.con, 'shakespeare')
5000

Running R 3.5.3 and Elastic 7.0.0 w/ Docker.

Session Info ```r > sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.3 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] base64enc_0.1-3 elastic_1.0.0.9100 loaded via a namespace (and not attached): [1] httr_1.4.0 compiler_3.5.3 R6_2.4.0 tools_3.5.3 httpcode_0.2.0 curl_3.3 Rcpp_1.0.1 urltools_1.7.3 triebeard_0.3.0 [10] crul_0.7.4 jsonlite_1.6 ```
sckott commented 5 years ago

thanks @Aeilert

i may be wrong here, but i think you ar missing the fact that the example has a query statement, q="a*", which leads to fewer matches than the total number of documents in that index. thoughts?

Aeilert commented 5 years ago

Yes, of course. You are completely right.