opensearch-project / opensearch-py

Python Client for OpenSearch
https://opensearch.org/docs/latest/clients/python/
Apache License 2.0
359 stars 178 forks source link

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

Closed mirodrr closed 1 year ago

mirodrr commented 1 year ago

What is the bug?

A clear and concise description of the bug. When using LangChain OpenSearchVectorSearch with OpenSearchServerless and doing:

docsearch = OpenSearchVectorSearch(index_name=ELASTICSEARCH_INDEX,
                                       embedding_function=embeddings,
                                       opensearch_url=f'https://{serverless_collection_host}:443',
                                       **opensearch_serverless_args)
docsearch.add_texts(all_text_split, metadatas=metadatas)

OR

docsearch = OpenSearchVectorSearch.from_texts(
                all_text_split,
                embeddings,
                metadatas=metadatas,
                opensearch_url=f'https://{serverless_collection_host}:443',
                bulk_size=12000,
                index_name=ELASTICSEARCH_INDEX,
                **opensearch_serverless_args
            )

I get the error

'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}

This same code works fine on 2.3.2.

How can one reproduce the bug?

Steps to reproduce the behavior. Create a "OpenSearchVectorSearch" in LangChain when using opensearch-py 2.4.0, and try to upload documents to it

What is the expected behavior?

A clear and concise description of what you expected to happen. Documents upload successfully

What is your host/environment?

Operating system, version. Python Lambda image. Specifically public.ecr.aws/lambda/python:3.9

Do you have any screenshots?

If applicable, add screenshots to help explain your problem. No

Do you have any additional context?

Add any other context about the problem. No

dblock commented 1 year ago

I can reproduce this against OpenSearch Serverless w/2.4.1. Code in https://github.com/dblock/opensearch-langchain.

Local OpenSearch

This works with all of OSS OpenSearch, and Amazon Managed OpenSearch and Serverless.

#!/usr/bin/env python3

from os import environ
from typing import List
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]

docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url='https://localhost:9200', 
    verify_certs=False, 
    http_auth=("admin", "admin")
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)

Amazon OpenSearch

This works with OSS OpenSearch, Amazon Managed OpenSearch, but not with Serverless on 2.4.1.

#!/usr/bin/env python3
import logging
from os import environ
from typing import List
from urllib.parse import urlparse
from opensearchpy import AWSV4SignerAuth, OpenSearch, RequestsHttpConnection, __versionstr__
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
from boto3 import Session

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

opensearch_url = environ['ENDPOINT']
url = urlparse(opensearch_url)
region = environ.get('AWS_REGION', 'us-east-1')
service = environ.get('SERVICE', 'es')
credentials = Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

print(f"Using opensearch-py {__versionstr__}")

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]

docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url=opensearch_url,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    http_auth=auth
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)
$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.3.2
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:404 request:0.306s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.346s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:11.820s]
INFO:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.330s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.458s]
$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.4.1
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:404 request:0.386s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:200 request:0.356s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.262s]
Traceback (most recent call last):
  File "/Users/dblock/source/langchain/hello/hello.py", line 31, in <module>
    docsearch = OpenSearchVectorSearch.from_texts(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 777, in from_texts
    return cls.from_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 901, in from_embeddings
    _bulk_ingest_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 138, in _bulk_ingest_embeddings
    bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 425, in bulk
    for ok, item in streaming_bulk(client, actions, ignore_status=ignore_status, *args, **kwargs):  # type: ignore
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 338, in streaming_bulk
    for data, (ok, info) in zip(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 273, in _process_bulk_chunk
    for item in gen:
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 202, in _process_bulk_chunk_success
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
opensearchpy.helpers.errors.BulkIndexError: ('3 document(s) failed to index.', [{'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '3ac5cf95-7f07-4d95-a44f-0615bb76aad1', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '0d234ec6-f620-4013-870e-1a082b813c94', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '7c54ac77-3853-4c70-a01b-55dae4de1115', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}}}}])
harshavamsi commented 1 year ago

Uses bulk

navneet1v commented 1 year ago

AOSS doesn't support passing _id in bulk. @dblock did we change anything from opensearch-py client?

@dblock

navneet1v commented 1 year ago

@dblock I remember we added a check in langchain to identity a diff between aoss and AOS Ref: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83-L91 . it was done using service attribute of AWSV4SignerAuth.

does that get changed?

dblock commented 1 year ago

I bisected this to https://github.com/opensearch-project/opensearch-py/pull/547.

We are sending different data!

method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
data: b'{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{},"id":"2a31333a-974f-4641-baf7-41072f221329"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{},"id":"c04199df-3304-416c-a74e-5f6150cbfe22"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{},"id":"da0b79e7-8e5c-4945-ab77-d8be9d5c197e"}\n
method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
body: b'{"index":{"_id":"db460c2e-8775-4e0d-b040-2acab808817c","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{}}\n{"index":{"_id":"e2977ca9-3394-4f59-9c8e-f70dd5685c91","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{}}\n{"index":{"_id":"3bf8f6ed-17ed-4f90-a047-5dc0284f1f08","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{}}\n'
navneet1v commented 1 year ago

@dblock this is interesting. Are we raising a PR to fix this or reverting the change?

dblock commented 1 year ago

The problem is here: https://github.com/langchain-ai/langchain/blob/b4312aac5c0567088353178fb70fdb356b372e12/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83

The code relies on signer#service, which is really an implementation detail, to figure out whether we're talking to AOS or AOSS. The data sent into bulk changes in LangChain depending on that.

[{'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, 'id': 'b46ca273-f691-4b09-9ef8-6e78bd9d0e30'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, 'id': 'a981f234-1fb2-4646-9e1c-bd36a3de21ab'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, 'id': '5a10a420-7a3a-4902-bcec-b7f18f8dde21'}]
[{'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, '_id': '7a0c551c-1764-4e93-9288-0185671a1cf1'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, '_id': '54f9d0d2-6ba9-4bc6-96ab-468400e35790'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, '_id': 'c1a0cc14-80ae-48f7-9b21-475a6fbf5022'}]
dblock commented 1 year ago

Fix in https://github.com/opensearch-project/opensearch-py/pull/603.

navneet1v commented 1 year ago

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

dblock commented 1 year ago

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

I don't believe that's correct, see https://github.com/opensearch-project/opensearch-py/pull/603#issuecomment-1816996140

dblock commented 12 months ago

We've released 2.4.2 with a fix.

JustinYuYou commented 9 months ago

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

dblock commented 9 months ago

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

Can you please open an issue in opensearch-js?