opensearch-project / opensearch-py

Python Client for OpenSearch
https://opensearch.org/docs/latest/clients/python/
Apache License 2.0
338 stars 170 forks source link

[BUG] #739

Closed clashofphish closed 5 months ago

clashofphish commented 5 months ago

What is the bug?

When I set up an OpenSearch.client using the header for authentication and attempt to use the bulk helper (opensearchpy.helpers.bulk) the client.bulk() call in _process_bulk_chunk returns a string, which causes the _process_bulk_chunk_success function to raise a TypeError when resp['index'] is called at line 185 in opensearchpy/helpers/actions.py.

I tried this with my header defining "Content-Type" as "application/json" and as "application/json; boundary=NL".

How can one reproduce the bug?

from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import uuid

client = OpenSearch(
    host ='apigateway.host.vpc.url',
    url_prefix='search',
    use_ssl=True,
    port=443,
    headers={
      'Authorization': base64_auth_header(os.environ['URL_KEY']),
      'Content-Type': 'application/json; boundary=NL',  # also tried with 'application/json'
    }
)

# Test connection works
responseGet = client.indices.get('test-index')

# Build request objects
requests = []
for n in nodes[0:3]:
    text = n.text
    metadata = {
        "noticeId": n.metadata['noticeId'],
        "department": n.metadata['department'],
    }
    request = {
        "_id": str(uuid.uuid4()),
        "_op_type": "index",
        "_index": 'test-index',
        "text": text,
        "metadata": metadata,
    }
    requests.append(request)

bulk(
    client,
    requests,
    max_chunk_bytes=1 * 1024 * 1024,
)

What is the expected behavior?

I would expect that the response is a json object that can be indexed using a string key.

What is your host/environment?

Do you have any screenshots?

image

Do you have any additional context?

Oddly enough this behavior does not happen on the OpenSearch domain that I deployed outside of the VPC when I use the http_auth=(username, password) parameter.

saimedhi commented 5 months ago

Hello @clashofphish, I will try to replicate this case in a VPC environment. In the meantime, if you find the cause or bug, please feel free to contribute. Thank you!

dblock commented 5 months ago

@clashofphish Is this in Amazon Managed OpenSearch? Do you have this reproduced with curl or awscurl so we can see if the problem is the client or the server?

clashofphish commented 5 months ago

@dblock This is the Amazon Managed OpenSearch. When I use curl I don't get the same error. Only when I attempt to use the SDK.

Let me know if you need more information.

dblock commented 5 months ago

@clashofphish This is helpful. Will you please post the working curl(s)?

I think the whole VPC business is a red herring. I'd start by removing the content type from your python code because bulk is ld-json, not json. Next I'd dig through the code to see exactly what's being sent up in the python client and received back and compare to the curl i/o.

clashofphish commented 5 months ago

I'll get the curl info for you.

In the mean time, I can tell you that when I turned logging on, the log messages from OpenSearch show that OS is getting the records and writing them correctly. Also, the count of records increases. It's just that the response object is a string rather than a json object.

The log message:

image

Also, I tried the code without the "Content-Type" specified in the header and had the exact same issue.

dblock commented 5 months ago

It's just that the response object is a string rather than a json object.

That is saying that the content type of the result is not evaluated properly, so needs to be debugged.

clashofphish commented 5 months ago

It's just that the response object is a string rather than a json object.

That is saying that the content type of the result is not evaluated properly, so needs to be debugged.

Agreed. The SDK does not evaluate the resulting response of the request to the endpoint correctly when I do my authorization using the header rather than http_auth parameter. Because it does not evaluate that result correctly it errors in the _process_bulk_chunk_success function.

Is there another way to tell the SDK to how to parse the result object that I'm missing?

Or am I not understanding what you are trying to say correctly?

dblock commented 5 months ago

Or am I not understanding what you are trying to say correctly?

I'm just saying it's not supposed to happen this way. It should "just work" (TM). So there's a bug somewhere :) Since you have a way to reproduce I am hoping you'll narrow it down by walking through the code ;)

Ideally, turn this into a failing unit test? I can try to fix from there.

clashofphish commented 5 months ago

I can't get the bulk curl request to work because it keeps giving me an error about having to end in a newline when I clearly have a newline in my call (I also tried having the data in a json file and using @reqs.json after --data-raw) --

curl -X POST --location 'https://<url>/test-index/_bulk' --header 'Authorization: <base64key>' --header 'Content-Type: application/json' --data-raw '{ "index": { "_index": "test-index", "_id": "1" } }\n{"id": "1", "text": "bob", "metadata": {"noticeId": "c7c, "department": "HOUSING"}}\n{ "index": { "_index": "test-index", "_id": "2" } }\n{"id": "2", "text": "jane", "metadata": {"noticeId": "6e9", "department": "HOUSING"}}\n'

I can tell you that my co-workers have been able to successfully make fetch calls to push documents to the index --

const request = await fetch('https://<url>/_bulk', {
    body: batch.map(JSON.stringify).join('\n') + '\n',
    method: 'POST',
    headers: {
      'Authorization': <token here>,
        'Content-Type': 'application/json; boundary=NL',
      },
  } )

I can also tell you that I know the request to client.bulk() that the bulk helper performs is working because my documents end up in my index. It's just that the response is a sting so it causes the post-processing of the response to fail. This only happens when I use the header to specify my authentication token for OS behind the VPC. It does not happen to the OS when I use http_auth with an OS instance not behind a VPC. From what I can see the calls to OS are the same in both instances.

I don't know what to do from here. I'm happy to provide more, but I need guidance on what you need.

clashofphish commented 5 months ago

This ticket can be closed. I narrowed the error down to the way the API Gateway and VPC where built. Sorry for the mix up. Thanks for your help regardless.

dblock commented 5 months ago

This ticket can be closed. I narrowed the error down to the way the API Gateway and VPC where built. Sorry for the mix up. Thanks for your help regardless.

I'm glad you fixed the issue. Could you help understand what the root problem/cause was here and how you figured it out?

clashofphish commented 4 months ago

The problem was that the API Gateway was configured incorrectly. The lesson is that even when you trust your coworkers, sometimes you still have to double check their work. The Gateway was setup such that it was stringifying the response object inside of a stringified object.

I figured this out by poking at my coworker for more help. It's partially my fault for being ignorant of how API Gateways works/was setup in this instance.

dblock commented 4 months ago

The Gateway was setup such that it was stringifying the response object inside of a stringified object.

I mean how was it setup to enable this behavior?