Add support for Amazon OpenSearch Serverless

sameercaresu commented 1 year ago

Is your feature request related to a problem?

I am trying to connect to Opensearch serverless collection from databrikcs. I can connect to Opensearch managed cluster using this. However, while trying to connect to serverless collection, I keep getting this error OpenSearchHadoopIllegalArgumentException: Cannot detect OpenSearch version - typically this happens if the network/OpenSearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'opensearch.nodes.wan.only' Caused by: OpenSearchHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://xxx.aoss.amazonaws.com:9200

I have tried following configuration "pushdown" -> "true", "opensearch.nodes" -> "https://xxx.aoss.amazonaws.com", "opensearch.nodes.wan.only" -> "true", "opensearch.aws.sigv4.region" -> "us-east-1", "opensearch.aws.sigv4.service.name" -> "aoss", "opensearch.aws.sigv4.enabled" -> "true"

What solution would you like?

Is it already possible to connect to Opensearch serverless? If yes, then could you please point me to correct set of configuration? If not, then I would like to request this feature.

What alternatives have you considered?

I used elasticsearch hadoop, but that doesn't work either with Opensearch serverless.

Do you have any additional context?

No.

harshavamsi commented 1 year ago

Hi @sameercaresu thanks for bringing this up. This is a known issue with OpenSearch serverless. The hadoop client makes a / root call to the cluster to get cluster info like uuid, version etc. But since serverless does not have those attributes, the client errors out. I am working on a fix as we speak.

ktech-rob commented 1 year ago

Hi, Just ran into the same issue as @sameercaresu. @harshavamsi I was wondering if there had been any update on this?

wbeckler commented 1 year ago

I haven't heard of any changes to Serverless to address this API gap.

eswar7216 commented 1 year ago

We have a usecase to connect to open search serverless from Apache spark. I am running into similar issue as well. Is there a workaround to connect to open search serverless from Apache spark ?

wbeckler commented 1 year ago

There is still no known workaround. If you do figure out a way, please share it here or propose a PR so we can patch the client.

eswar7216 commented 1 year ago

Not sure if everyone is doing the same as what I was trying to do but it works for me. I have data in database which I was trying to get it into openSearch using apache spark in EMR. Database -> EMR(Saprk) -> OpenSearch serverless.

I am using opensearch-hadoop(java) to connect to openSearch serverless (this is deployed in VPC) using a vpc endpooint something below and it works for me,

Map<String, String> map = new HashMap<>();
map.put("opensearch.nodes", "vpc domain url");
map.put("opensearch.port", "443");
map.put("opensearch.resource", "index_name");
map.put("opensearch.nodes.wan.only", "true");

//data here is which I read from database
JavaOpenSearchSparkSQL.saveToOpenSearch(data, map);

Xtansia commented 8 months ago

I've done some quick investigation into this and it's more extensive than just the / info request. The first few missing APIs I hit while trying to do a simple write from spark can be worked around, though not ideally:

GET / - can be hardcoded to return a dummy info if targeting serverless.
GET /_cluster/health/{index} - can be hardcoded to GREEN if serverless
POST /{index}/_refresh - can be NOOP if serverless

The bigger issue I then hit trying to do a read:

GET /{index}/_search_shards which is used to determine partitions for reading and serverless doesn't support shard information

wbeckler commented 3 months ago

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop: https://github.com/opensearch-project/opensearch-hadoop/blob/c9a6a1cb11404d2e738bbe57cc724609f3020e7f/mr/src/main/java/org/opensearch/hadoop/rest/RestRepository.java#L279. Thoughts?

Xtansia commented 2 months ago

@Xtansia It looks like _search_shards is getting called even when the setting os.nodes.client.only is set to TRUE. In that scenario the _search_shards is useless and shouldn't execute since no shards will map to non-data nodes. That means this should be a noop:

https://github.com/opensearch-project/opensearch-hadoop/blob/c9a6a1cb11404d2e738bbe57cc724609f3020e7f/mr/src/main/java/org/opensearch/hadoop/rest/RestRepository.java#L279

. Thoughts?

It's not quite as simple as just not calling it, as it uses the shards to determine how to partition the job within Spark for parallelisation. Serverless doesn't expose any shard information. It may be possible to workaround and hard code 1 or a configurable number of partitions for serverless, but I haven't dug into it far enough to know if that's feasible if other parts of the code expect to use an actual shard ID

itiyama commented 2 months ago

@Xtansia How does Hadoop client use the following APIs? I am exploring a solution to support some dummy/empty response for these APIs in serverless to support backward compatibility. But without understanding how the client uses these APIs, returning dummy response would be of no use.

GET / - what does it do with the response? Let's say serverless deployment returns empty response for all fields or some dummy value? Would that work?
GET /_cluster/health/{index} - we can hard code this to GREEN, are there any other response params that the client relies on?
POST /{index}/_refresh - yes, this can be NOOP.

Leo-Rola commented 2 months ago

@dblock i have seen you pinned this issue about using hadoop with opensearch serverless. Can I ask what have you solved? if for example I want to use glue to transfer documents from a collection OpenSearch Serverless to another, now could I do that? Thanks in advance

opensearch-project / opensearch-hadoop