ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Wrong results when trying to apply elastic to different versions of Elasticsearch #267

Closed regisoc closed 4 years ago

regisoc commented 4 years ago

Hi,

Before starting, you should know that I'm new to R and its env. Though, I prefer taking the risk to say some stupidies. Correct me if I'm wrong.

I am not sure if elastic is retrocompatible as you claim. I have built 2 examples to show my point.

The first case is direct: I don't have the same results between ES versions, following the (reference).

Here, ES 5.6.16 is wrong ```r > connexion <- connect(host = "elasticsearch", user = "elastic", pwd = "changeme") > elastic::index_get(connexion)$version$number [1] "5.6.16" > elastic::index_create(connexion, index = "wow") $acknowledged [1] TRUE $shards_acknowledged [1] TRUE $index [1] "wow" > elastic::index_exists(connexion, index = "wow") [1] FALSE # <------- ???? > elastic::index_delete(connexion, index = "wow") http://elasticsearch:9200/wow $acknowledged [1] TRUE ```
Seems right in ES 6.8.3 ```r > connexion <- connect(host = "elasticsearch", user = "elastic", pwd = "changeme") > elastic::index_get(connexion)$version$number [1] "6.8.3" > elastic::index_create(connexion, index = "wow") $acknowledged [1] TRUE $shards_acknowledged [1] TRUE $index [1] "wow" > elastic::index_exists(connexion, index = "wow") [1] TRUE # <------- ok > elastic::index_delete(connexion, index = "wow") http://elasticsearch:9200/wow $acknowledged [1] TRUE ```
The last version (ES 7.4.0) seems also right ```r > connexion <- connect(host = "elasticsearch", user = "elastic", pwd = "changeme") > elastic::index_get(connexion)$version$number [1] "7.4.0" > elastic::index_create(connexion, index = "wow") $acknowledged [1] TRUE $shards_acknowledged [1] TRUE $index [1] "wow" > elastic::index_exists(connexion, index = "wow") [1] TRUE # <------- ok > elastic::index_delete(connexion, index = "wow") http://elasticsearch:9200/wow $acknowledged [1] TRUE ```

I also double checked with curl.


The second case is a bit more problematic.

Again, it seems to work with the last version (here, ES 7.4.0) but I have bigger issues with ES 5.6.16 and ES 6.8.3: all the data were not indexed using the docs_bulk method, meaning some data were lost.

I tried to apply the following script to test that.

Application with the `mpg` dataset (included in `tidyverse` lib = static 234 lines, 11 cols) ```r library(tidyverse) library(elastic) # init connexion <- connect(host = "elasticsearch", user = "elastic", pwd = "changeme") print(elastic::index_get(connexion)$version$number) index_name = "mpg" data <- mpg max_test <- 20 res <- list() # nb of records/observations given to ES that we want to retrieve expected_obs <- dim(data)[1] # progress bar pb <- txtProgressBar(min = 0, max = max_test, initial = 0, style = 3) pbsum <- 0 # delete # elastic::index_delete(connexion, index_name, verbose = F, wait_for_completion = T) # push method exe <- function(){ # init elastic::index_delete(connexion, index_name, verbose = F, wait_for_completion = T) # push invisible(elastic::docs_bulk(connexion, data, index = index_name, quiet = T, wait_for_completion = T)) # **Near** real time Sys.sleep(1) # get count # elastic::Search(connexion, index = index_name, size = 1)$hits$total elastic::Search(connexion, index = index_name, size = 1)$hits$total == expected_obs } # test for(i in 1:max_test){ res[i] <- exe() pbsum <- pbsum + 1 setTxtProgressBar(pb, pbsum) } close(pb) ```

Sending the same data (mpg) over and over again, we should get the same result (here, should end with 20 TRUE values inside res).

For ES 7.4.0, the result is as expected:

> unlist(res, use.names = FALSE) %>% summary() %>% print()
   Mode   FALSE    TRUE 
logical      0       20 

For ES 6.8.3, the result is not as expected:

> unlist(res, use.names = FALSE) %>% summary() %>% print()
   Mode   FALSE    TRUE 
logical      16       4 

For ES 5.6.16, the result is not as expected:

> unlist(res, use.names = FALSE) %>% summary() %>% print()
   Mode   FALSE    TRUE 
logical      17       3 

Can you reproduce? Did I miss something?

Limitation: I tested index_exists() and docs_bulk(), not others functions.

Session Info ```r > devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.6.1 (2019-07-05) os Debian GNU/Linux 9 (stretch) system x86_64, linux-gnu ui RStudio language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz Etc/UTC date 2019-11-04 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.1) broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.1) callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) crul 0.8.4 2019-08-02 [1] CRAN (R 3.6.1) curl 4.2 2019-09-24 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1) digest 0.6.21 2019-09-20 [1] CRAN (R 3.6.1) dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) elastic * 1.0.0 2019-04-11 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.6.1) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) haven 2.1.1 2019-07-04 [1] CRAN (R 3.6.1) hms 0.5.1 2019-08-23 [1] CRAN (R 3.6.1) httpcode 0.2.0 2016-11-14 [1] CRAN (R 3.6.1) httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1) jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.1) lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.1) lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.1) lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.1) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.1) magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.1) memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1) modelr 0.1.5 2019-08-08 [1] CRAN (R 3.6.1) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1) nlme 3.1-140 2019-05-12 [2] CRAN (R 3.6.1) pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.1) pkgbuild 1.0.5 2019-08-26 [1] CRAN (R 3.6.1) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1) pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.1) prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.1) processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1) ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.1) purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.1) R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.1) Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.1) readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.1) readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.1) remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1) rlang 0.4.0 2019-06-25 [1] CRAN (R 3.6.1) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1) rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.1) rvest 0.3.4 2019-05-15 [1] CRAN (R 3.6.1) scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.1) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1) stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.1) stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.1) testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.1) tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.1) tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.1) tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.1) tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.6.1) triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.6.1) urltools 1.7.3 2019-04-14 [1] CRAN (R 3.6.1) usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1) vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.1) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1) xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.1) zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.1) [1] /usr/local/lib/R/site-library [2] /usr/local/lib/R/library ```
regisoc commented 4 years ago

To complete: here is the docker-compose.yml to switch between all versions.

version: '3.7'
services:
  elasticsearch:
    container_name: elasticsearch
    # choose one
    # image: docker.elastic.co/elasticsearch/elasticsearch:5.6.16
    image: docker.elastic.co/elasticsearch/elasticsearch:6.8.3
    # image: docker.elastic.co/elasticsearch/elasticsearch:7.4.0
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - 9200:9200
      - 9300:9300
    volumes:
      - type: bind
        source: ./elasticsearch.yml
        target: /usr/share/elasticsearch/config/elasticsearch.yml
        read_only: true
    networks:
      - esr

  rstudio:
    container_name: rstudio
    image: roncar/rstudio-elastic:1.0.0
    environment:
      - PASSWORD=rstudiopwd
      - USERID=1000
    ports:
      - 8787:8787
    networks:
      - esr

networks:
  esr:
sckott commented 4 years ago

Thanks for opening the issue @regisoc

For the first issue about index_create/index_exists, I could not replicate the problem with a local version of Elasticsearch running on my mac, but I COULD replicate using docker with a similar compose file to yours. it looks like its coming from the underlying http client crul. the head request in index_exists isn't passing along the credentials. fixing that now ...

will address the 2nd one after the first one is fixed

sckott commented 4 years ago

For the 2nd problem with docs bulk, I was able to replicate the problem, both with local ES and in docker, and with the same versions that you had a problem with.

However, I was also able to replicate the problem using curl on the command line, completely outside of R. So it's probably not a problem with this package, but more likely an issue with the older versions of Elasticsearch. OR possibly a problem with the way we're constructing the nd-json files.

unfortunately, elasticsearch doesn't give us the failed lines of the nd-json that didn't get created. so it's hard to track down why this is happening

sckott commented 4 years ago

The auth problem with index_exists has been fixed, install dev version remotes::install_github("ropensci/elastic"), which should install the dev version of crul with the fix.

The other problem with docs bulk: I think it's down to mappings. If you don't set a mapping for your index, ES tries to guess, and sometimes a later document has a value that conflicts with the intitial type that ES sets, and then it fails. I think the fix when this happens is to set the mapping, and I think you only have to for the problematic fields, but you could for all of them anyway, e,g ,.

library(elastic)
zz <- connect(user = "elastic", pwd = "changeme", errors = "complete")
body <- '{
 "mappings": {
   "mpg": {
     "properties": {
       "displ" : {"type" : "float"}
      }
   }
 }
}'
index_create(zz, index='mpg', body=body)
out <- docs_bulk(zz, mpg, index = 'mpg')
out[[1]]$errors
Sys.sleep(1)
elastic::count(zz, "mpg")
index_delete(zz, index_name, verbose = FALSE)

The above works over and over again, so i think setting the index mapping is the fix for ES versions where you get intermittent failures with docs bulk

regisoc commented 4 years ago

Ok, thanks. I will update and try it soon.

sckott commented 4 years ago

assuming this is fixed