ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

how to access stored field and scroll_id #299

Open ale-ful opened 1 year ago

ale-ful commented 1 year ago

Hello, I am having difficulties to pull stored field and scroll_id.

Stored field: The field is called "text" and in Kibana I can see it is present for the index "document-000002". When specifying "text" as a value for parameter stored_fields I don't get it pulled, instead only "_index", "_type", "_id", "_score" and "_source" are present in the resulting list (first two lines of code). When I tested the line with source parameter, element "_source" was an empty list.

An exemplary record from ES, accessed via Kibana:

{
  "_index": "document-000002",
  "_type": "_doc",
  "_id": "AS_63689606",
  "_version": 1,
  "_score": 1,
  "_source": {
    "visitid": "65_63209606",
    "processingdate": "2022-08-24 17:24-0400",
    "gender": "male",
    "facility": "40998",
    "user": "JOHNDOE",
    "customer": "656"
  },
  "fields": {
    "processingdate": [
      "2022-08-24T21:24:00.000Z"
    ],
    "servicedate": [
      "2022-08-22T22:05:00.000Z"
    ],
    "text": [
      "an exemplary text I want to pull"
    ]
  }
}

Tried code:

library(elastic)
docs <- Search(c, "document-000002", size = 8, stored_fields = "text")$hits$hits
docs <- Search(c, "document-000002", size = 8, stored_fields = c("text", "servicedate"))$hits$hits
docs <- Search(c, "document-000002", size = 8, source = "text")$hits$hits

scroll_id I would like to use scroll parameter to pull more than the default 10K documents for the same index. I see it should be possible, because:

all_docs <- Search(conn = c, index = "document-000002")
all_docs$hits$total$value
all_docs$`_scroll_id`

total hits amount to more than 8 millions. However, scroll ID is always NULL

I will appreciate any help.

ES version in use: 7.3.1 Elastic package version in use: 1.2.0

cphaarmeyer commented 1 year ago

Did you try to set the time_scroll parameter of Search()? See also https://docs.ropensci.org/elastic/articles/search.html#scrolling-search---instead-of-paging

ale-ful commented 1 year ago

@cphaarmeyer thank you! Specifying parameter time_scroll for Search() was sufficient to access _scroll_id.

Therefore, the working code looks like this:

all_docs <- Search(conn = c, index = "document-000002", time_scroll = "1m")
all_docs$`_scroll_id`

Do you have any thoughts on how to pull stored field?

cphaarmeyer commented 1 year ago

Do you mean something like this?

docs <- Search(conn = c, index = "document-000002", size = 6, body = list(`_source` = "text"))
lapply(docs$hits$hits, function(x) x[["_source"]][[1]])
ale-ful commented 1 year ago

Indeed, code you propose is somehow suggested in Search() documentation, and I also tried it (although not as a part of the body). However, it doesn't work, because "text" is a stored field, not a part of "_source" (see structure of the record I pasted as a part of my question). According to documentation, it should be pulled by specifying stored_fields parameter, but it is not the case.

cphaarmeyer commented 1 year ago

Oh sorry. Then I don't know. I have never seen such a setup.