ropensci / elastic

R client for the Elasticsearch HTTP API
https://docs.ropensci.org/elastic
Other
245 stars 58 forks source link

SparkR API to read and write from ElasticSearch via distributed dataframes #213

Closed whs2k closed 3 years ago

whs2k commented 6 years ago

Is there a way to write a sparkR dataframe or RDD to ElasticSearch with multiple nodes?

This elastic package for R is great for normal interactions with ElasticSearch but says nothing about hadoop, distributed dataframes, or RDDs in SparkR 2.0+. When I try to use it I get the following errors:

install.packages("elastic", repos = "http://cran.us.r-project.org")
library(elastic)
sparkR.session(enableHiveSupport = TRUE)
df <- read.json('/hadoop/file/location')
connect(es_port = 9200, es_host = 'https://hostname.dev.company.com', es_user = 'username', es_pwd = 'password')
docs_bulk(df)

Error: no 'docs_bulk' method for class SparkDataFrame

If this were PySpark, I would use the rdd.saveAsNewAPIHadoopFile() function as shown here, but I can't find any information about it in SparkR from googling. ElasticSearch also has good documentation, but only for Scala and Java.

Note that my elastic cluster has multiple nodes and, in Zeppelin, I am using the %spark2.r interpreter. This is a re-post of a SO question.

sckott commented 6 years ago

thanks for your question @whs2k

is your main use case inserting data? or inserting and reading?

whs2k commented 6 years ago

Main use case is inserting data into ES; reading would be a nice to have but is not the priority right now. FYI in PySpark, reading RDD's from elastic is handled by the newAPIHadoopRDD() method

sckott commented 6 years ago

okay, thanks

whs2k commented 4 years ago

Posting StackOverflow conversation here for viz: https://stackoverflow.com/questions/49141042/how-to-read-and-write-to-elasticsearch-with-sparkr/62385203#62385203

sckott commented 4 years ago

@whs2k so does that SO answer solve your problem? Or do you still hope for some solution with this package?