ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Missing data with `docs_bulk*` on new index #271

Closed regisoc closed 4 years ago

regisoc commented 4 years ago

Hi,

I am using ES version 6.8.3, via docker (docker.elastic.co/elasticsearch/elasticsearch:6.8.3)

I tried to push tidyverse datasets (storms here) to test several cases, and this happened:


c <- connect(...)
s <- "storms"
elastic::create(c, s)
# $acknowledged
# [1] TRUE

# $shards_acknowledged
# [1] TRUE

# $index
# [1] "storms"

elastic::docs_bulk_index(c, storms, s)
# lots of things...

r <- elastic::Search(c, s)

r$hits$total == dim(storms)[1]
# FALSE

r$hits$total
# [1] 10006

dim(storms)[1]
# [1] 10010

If the index is deleted and then reconstructed (with index_delete -> index_create or index_recreate), the number of records registered in ES (r$hits$total) is not stable and I never get the full 10010 records registered.

But, I think the mapping update is involved in some ways during the docs_bulk (logs hereafter), because when I do not recreate the index, there is no missing data.

...
[2020-01-10T17:36:34,371][WARN ][o.e.d.c.m.MetaDataCreateIndexService] [KiPi1Td] the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template
[2020-01-10T17:36:34,374][INFO ][o.e.c.m.MetaDataCreateIndexService] [KiPi1Td] [storms] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2020-01-10T17:36:37,171][INFO ][o.e.c.m.MetaDataMappingService] [KiPi1Td] [storms/kf_TeuAfTzmmLDZ1JCtAHA] create_mapping [storms]
[2020-01-10T17:36:39,015][INFO ][o.e.c.m.MetaDataMappingService] [KiPi1Td] [storms/kf_TeuAfTzmmLDZ1JCtAHA] update_mapping [storms]
...

Can you reproduce?

Session Info ```r R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 9 (stretch) Matrix products: default BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] jsonlite_1.6 elastic_1.0.0 data.table_1.12.6 forcats_0.4.0 stringr_1.4.0 [6] dplyr_0.8.3 purrr_0.3.3 readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 [11] ggplot2_3.2.1 tidyverse_1.2.1 R6_2.4.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2 compiler_3.6.1 tools_3.6.1 zeallot_0.1.0 [7] lubridate_1.7.4 lifecycle_0.1.0 nlme_3.1-140 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.3 [13] rlang_0.4.1 cli_1.1.0 rstudioapi_0.10 curl_4.2 crul_0.8.4 haven_2.1.1 [19] withr_2.1.2 xml2_1.2.2 httr_1.4.1 generics_0.0.2 vctrs_0.2.0 hms_0.5.2 [25] triebeard_0.3.0 grid_3.6.1 tidyselect_0.2.5 httpcode_0.2.0 glue_1.3.1 readxl_1.3.1 [31] modelr_0.1.5 magrittr_1.5 urltools_1.7.3 backports_1.1.5 scales_1.0.0 rvest_0.3.4 [37] assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2 [43] crayon_1.3.4 ```
sckott commented 4 years ago

thanks @regisoc for the issue.

create is not an exported or even internal function, i assume you mean index_create?

Part of the discrepancy may be due to not waiting until the data are "available".

library(elastic)
library(tidyverse)
x <- connect()
s <- "storms"

For example, compare these two:

index_create(x, s)
invisible(elastic::docs_bulk_index(x, storms, s, s))
r <- elastic::Search(x, s)
r$hits$total == dim(storms)[1]

vs.

index_recreate(x, s)
invisible(elastic::docs_bulk_index(x, storms, s, s))
Sys.sleep(2)
rr <- elastic::Search(x, s)
rr$hits$total == dim(storms)[1]

But looking at logs I'm seeing that the indexing is running into errors, e.g., mapper [hu_diameter] cannot be changed from type [long] to [float] - I imagine setting the mapping up front when you create the index will fix that.

FWIW, i'm not having this problem in Elasticsearch 7.5.1

regisoc commented 4 years ago

Thanks for your quick comment. Yes, it was index_create, and yes, I am obtaining the same results in logs. I was hoping for ES deducing the right mapping by his own but it is having some difficulties doing so. Pushing an explicit mapping resolves this. BTW, when automating stuffs, seems like we effectively need to Sys.sleep() at least one sec by default between each steps (create index, give it mapping, push data). Resolved.

sckott commented 4 years ago

It's possible there's some configuration options in your Elasticsearch instance for how soon data becomes available, I don't know.