ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

Extract data from elasticsearch V5 into R with elastic package, load into a data frame, #175

Closed mykesmith closed 7 years ago

mykesmith commented 7 years ago

Using your response in StackOverflow to "Extract data from elasticsearch into R with elastic package, load into a data frame, error due to hits not expanding to the same length" I ran into problems with ES v5

Using your Shakesphere example:

Search(index="shakespeare", fields=c('play_name','speaker'), asdf = TRUE) Error: "fields" parameter is deprecated in ES >= v5. Use "_source" in body See also "fields" parameter in ?Search

so I switched to:

Search(index="shakespeare", body='{"_source":["play_name","speaker"]}', asdf = TRUE)$hits$hits

however the output has lots of extraneous output:

    _index _type _id _score _source.play_name _source.speaker

1 shakespeare act 0 1 Henry IV
2 shakespeare line 14 1 Henry IV KING HENRY IV 3 shakespeare line 19 1 Henry IV KING HENRY IV 4 shakespeare line 22 1 Henry IV KING HENRY IV 5 shakespeare line 24 1 Henry IV KING HENRY IV 6 shakespeare line 25 1 Henry IV KING HENRY IV 7 shakespeare line 26 1 Henry IV KING HENRY IV 8 shakespeare line 29 1 Henry IV KING HENRY IV 9 shakespeare line 40 1 Henry IV WESTMORELAND 10 shakespeare line 41 1 Henry IV WESTMORELAND

What needs to be done/changed to get back to what your got with ES 2?

> play_name speaker

> 1 Henry IV

> 2 Henry IV KING HENRY IV

> 3 Henry IV KING HENRY IV

> 4 Henry IV KING HENRY IV

> 5 Henry IV KING HENRY IV

> 6 Henry IV KING HENRY IV

> 7 Henry IV KING HENRY IV

> 8 Henry IV KING HENRY IV

> 9 Henry IV WESTMORELAND

> 10 Henry IV WESTMORELAND

sckott commented 7 years ago

thanks for the issue @mykesmith

I haven't updated this client fully to make sure it works with all aspects of ES v5

for this problem, first, what version of elastic are you using?

mykesmith commented 7 years ago

Scott, According to R I am using "0.7.8.9515"

sckott commented 7 years ago

@mykesmith Try this

Search(index="shakespeare", source=c("play_name","speaker"), asdf = TRUE)$hits$hits
        _index _type _id _score _source.play_name _source.speaker
1  shakespeare   act   0      1          Henry IV                
2  shakespeare  line  14      1          Henry IV   KING HENRY IV
3  shakespeare  line  19      1          Henry IV   KING HENRY IV
4  shakespeare  line  22      1          Henry IV   KING HENRY IV
5  shakespeare  line  24      1          Henry IV   KING HENRY IV
6  shakespeare  line  25      1          Henry IV   KING HENRY IV
7  shakespeare  line  26      1          Henry IV   KING HENRY IV
8  shakespeare  line  29      1          Henry IV   KING HENRY IV
9  shakespeare  line  40      1          Henry IV    WESTMORELAND
10 shakespeare  line  41      1          Henry IV    WESTMORELAND

Those other fields are metadata about each document, i think they always are returned

You can quickly get just those two fields you want, or any source fields like

library(dplyr)
x %>% select(contains("_source"))
   _source.play_name _source.speaker
1           Henry IV                
2           Henry IV   KING HENRY IV
3           Henry IV   KING HENRY IV
4           Henry IV   KING HENRY IV
5           Henry IV   KING HENRY IV
6           Henry IV   KING HENRY IV
7           Henry IV   KING HENRY IV
8           Henry IV   KING HENRY IV
9           Henry IV    WESTMORELAND
10          Henry IV    WESTMORELAND

you may also want to rename

x %>% 
  select(contains("_source")) %>% 
  rename(play_name = `_source.play_name`, speaker = `_source.speaker`)
mykesmith commented 7 years ago

Scott, Thank you, your suggestions worked; I will use this on my own datasets and see where I get. I was reading over the weekend on how to edit the titles but you beat me to it.

Again thanks and this is an excellent package.

sckott commented 7 years ago

glad it works, thx for kind words @mykesmith