[Feature Request] IAM Role Based Authentication for Spark to Elasticsearch

AlJohri commented 1 year ago

Is your feature request related to a problem? Please describe. I would like to use IAM role based authentication for connecting a Spark job to an OpenSearch cluster.

Describe the solution you'd like I want to have a setting that enables this library to use the AWS Credential Provider Chain to sign all requests going to OpenSearch with AWS SigV4.

Describe alternatives you've considered Alternatives are:

for a public OpenSearch cluster, using fine grained access control which gives you a master username password which you can then use to auth from the spark job
for a public OpenSearch cluster or cluster running in a VPC, use IP filtering to grant access to where the Spark cluster runs
AWS Marketplace Elasticsearch Connector for AWS Glue
TK

Additional context Internally at Amazon, Shepard is now cutting tickets for "OpenSearch Service Domain Uses IP Filtering" so this would be a very useful feature to have soon.

AlJohri commented 1 year ago

Related Issues/PRs:

subbaray commented 1 year ago

Can we get this feature?

thewaychung commented 1 year ago

+1

parvathi-nair commented 1 year ago

+1

wbeckler commented 1 year ago

This makes total sense. Feel free to take a stab at this.

ennio1991 commented 1 year ago

+1

wbeckler commented 1 year ago

Just an update that the SigV4 work is continuing.

harshavamsi commented 1 year ago

This is now complete and merged into main.

wbeckler commented 1 year ago

Just curious, for those requesting this, for which version of Elasticsearch or OpenSearch did you need this capability?

AlJohri commented 1 year ago

Elasticsearch 7.10

AlJohri commented 1 year ago

hey @harshavamsi, I saw this note:

Initial support for OpenSearch 1.x and ES 7.10 has been merged into the 1.0 branch.

Does that mean SigV4 signing has already been backported to the 1.0 branch or is that still pending?

harshavamsi commented 1 year ago

@AlJohri SigV4 has been backported and should work.

dgoldenberg-ias commented 1 month ago

Hi @harshavamsi, could you please explain how this feature works?

Currently, I'm able to use the Spark connector to read and write data from/to OpenSearch.

For example:

reader =
  spark.read.format("opensearch")
  .option("opensearch.nodes", f"https://{self.host}")
  .option("opensearch.port", str(self.port))
  .option("opensearch.resource", self.index_name)
  .option("opensearch.nodes.wan.only", "true")
  .option("opensearch.net.ssl", "true")
  .option("opensearch.aws.sigv4.enabled", "true")
  .option("opensearch.aws.sigv4.region", self.aws_region)
  .option("pushdown", "true")
  .option("query", query_str)
  reader.load()

How does the connector authenticate when this code runs in my databricks notebook?

When authenticating with opensearch-py, our code does the following:

self.client = OpenSearch(
  hosts=[f"{self.host}:{self.port}"],
  http_auth=get_aws_auth(),
  use_ssl=True,
  verify_certs=True,
  connection_class=RequestsHttpConnection
)

where the auth is basically plugged in as follows:

session = boto3.Session()
session_credentials = session.get_credentials().get_frozen_credentials()

And that works, but that is via opensearch-py. How does it work in opensearch-hadoop?

I see that the opensearch-hadoop connector is able to authenticate and connect to the OpenSearch Domain we set up. Eventually, however, that internal session is expired and we get a 403: AuthorizationException: AuthorizationException(403, '').

Are there more parameters we need to pass in? We need to be able to keep the session or re-establish it if it expires so as to be able to support long-running bulk operations, especially bulk indexing.

Would appreciate an explanation + any pointers. Thanks.

opensearch-project / opensearch-hadoop

[Feature Request] IAM Role Based Authentication for Spark to Elasticsearch #28