SparkR API to read and write from ElasticSearch via distributed dataframes

whs2k commented 6 years ago

Is there a way to write a sparkR dataframe or RDD to ElasticSearch with multiple nodes?

This elastic package for R is great for normal interactions with ElasticSearch but says nothing about hadoop, distributed dataframes, or RDDs in SparkR 2.0+. When I try to use it I get the following errors:

install.packages("elastic", repos = "http://cran.us.r-project.org")
library(elastic)
sparkR.session(enableHiveSupport = TRUE)
df <- read.json('/hadoop/file/location')
connect(es_port = 9200, es_host = 'https://hostname.dev.company.com', es_user = 'username', es_pwd = 'password')
docs_bulk(df)

Error: no 'docs_bulk' method for class SparkDataFrame

If this were PySpark, I would use the rdd.saveAsNewAPIHadoopFile() function as shown here, but I can't find any information about it in SparkR from googling. ElasticSearch also has good documentation, but only for Scala and Java.

Note that my elastic cluster has multiple nodes and, in Zeppelin, I am using the %spark2.r interpreter. This is a re-post of a SO question.

sckott commented 6 years ago

thanks for your question @whs2k

is your main use case inserting data? or inserting and reading?

whs2k commented 6 years ago

Main use case is inserting data into ES; reading would be a nice to have but is not the priority right now. FYI in PySpark, reading RDD's from elastic is handled by the newAPIHadoopRDD() method

sckott commented 6 years ago

okay, thanks

whs2k commented 4 years ago

Posting StackOverflow conversation here for viz: https://stackoverflow.com/questions/49141042/how-to-read-and-write-to-elasticsearch-with-sparkr/62385203#62385203

sckott commented 4 years ago

@whs2k so does that SO answer solve your problem? Or do you still hope for some solution with this package?

ropensci / elastic

SparkR API to read and write from ElasticSearch via distributed dataframes #213