Closed mirodrr closed 1 year ago
I can reproduce this against OpenSearch Serverless w/2.4.1. Code in https://github.com/dblock/opensearch-langchain.
This works with all of OSS OpenSearch, and Amazon Managed OpenSearch and Serverless.
#!/usr/bin/env python3
from os import environ
from typing import List
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
fake_texts = ["foo", "bar", "baz"]
class FakeEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
return [float(1.0)] * 9 + [float(0.0)]
docsearch = OpenSearchVectorSearch.from_texts(
fake_texts,
FakeEmbeddings(),
opensearch_url='https://localhost:9200',
verify_certs=False,
http_auth=("admin", "admin")
)
OpenSearchVectorSearch.add_texts(
docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)
This works with OSS OpenSearch, Amazon Managed OpenSearch, but not with Serverless on 2.4.1.
#!/usr/bin/env python3
import logging
from os import environ
from typing import List
from urllib.parse import urlparse
from opensearchpy import AWSV4SignerAuth, OpenSearch, RequestsHttpConnection, __versionstr__
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
from boto3 import Session
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
opensearch_url = environ['ENDPOINT']
url = urlparse(opensearch_url)
region = environ.get('AWS_REGION', 'us-east-1')
service = environ.get('SERVICE', 'es')
credentials = Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)
print(f"Using opensearch-py {__versionstr__}")
fake_texts = ["foo", "bar", "baz"]
class FakeEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
return [float(1.0)] * 9 + [float(0.0)]
docsearch = OpenSearchVectorSearch.from_texts(
fake_texts,
FakeEmbeddings(),
opensearch_url=opensearch_url,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
http_auth=auth
)
OpenSearchVectorSearch.add_texts(
docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)
$ poetry run python hello.py
INFO:Found credentials in environment variables.
Using opensearch-py 2.3.2
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:404 request:0.306s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.346s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:11.820s]
INFO:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.330s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.458s]
$ poetry run python hello.py
INFO:Found credentials in environment variables.
Using opensearch-py 2.4.1
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:404 request:0.386s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:200 request:0.356s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.262s]
Traceback (most recent call last):
File "/Users/dblock/source/langchain/hello/hello.py", line 31, in <module>
docsearch = OpenSearchVectorSearch.from_texts(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 777, in from_texts
return cls.from_embeddings(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 901, in from_embeddings
_bulk_ingest_embeddings(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 138, in _bulk_ingest_embeddings
bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 425, in bulk
for ok, item in streaming_bulk(client, actions, ignore_status=ignore_status, *args, **kwargs): # type: ignore
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 338, in streaming_bulk
for data, (ok, info) in zip(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 273, in _process_bulk_chunk
for item in gen:
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 202, in _process_bulk_chunk_success
raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
opensearchpy.helpers.errors.BulkIndexError: ('3 document(s) failed to index.', [{'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '3ac5cf95-7f07-4d95-a44f-0615bb76aad1', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '0d234ec6-f620-4013-870e-1a082b813c94', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '7c54ac77-3853-4c70-a01b-55dae4de1115', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}}}}])
Uses bulk
AOSS doesn't support passing _id in bulk. @dblock did we change anything from opensearch-py client?
@dblock
@dblock I remember we added a check in langchain to identity a diff between aoss and AOS Ref: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83-L91 . it was done using service
attribute of AWSV4SignerAuth.
does that get changed?
I bisected this to https://github.com/opensearch-project/opensearch-py/pull/547.
We are sending different data!
method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
data: b'{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{},"id":"2a31333a-974f-4641-baf7-41072f221329"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{},"id":"c04199df-3304-416c-a74e-5f6150cbfe22"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{},"id":"da0b79e7-8e5c-4945-ab77-d8be9d5c197e"}\n
method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
body: b'{"index":{"_id":"db460c2e-8775-4e0d-b040-2acab808817c","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{}}\n{"index":{"_id":"e2977ca9-3394-4f59-9c8e-f70dd5685c91","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{}}\n{"index":{"_id":"3bf8f6ed-17ed-4f90-a047-5dc0284f1f08","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{}}\n'
@dblock this is interesting. Are we raising a PR to fix this or reverting the change?
The problem is here: https://github.com/langchain-ai/langchain/blob/b4312aac5c0567088353178fb70fdb356b372e12/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83
The code relies on signer#service
, which is really an implementation detail, to figure out whether we're talking to AOS or AOSS. The data sent into bulk
changes in LangChain depending on that.
[{'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, 'id': 'b46ca273-f691-4b09-9ef8-6e78bd9d0e30'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, 'id': 'a981f234-1fb2-4646-9e1c-bd36a3de21ab'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, 'id': '5a10a420-7a3a-4902-bcec-b7f18f8dde21'}]
[{'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, '_id': '7a0c551c-1764-4e93-9288-0185671a1cf1'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, '_id': '54f9d0d2-6ba9-4bc6-96ab-468400e35790'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, '_id': 'c1a0cc14-80ae-48f7-9b21-475a6fbf5022'}]
@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it
@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it
I don't believe that's correct, see https://github.com/opensearch-project/opensearch-py/pull/603#issuecomment-1816996140
We've released 2.4.2 with a fix.
Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock
Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock
Can you please open an issue in opensearch-js?
What is the bug?
A clear and concise description of the bug. When using LangChain OpenSearchVectorSearch with OpenSearchServerless and doing:
OR
I get the error
This same code works fine on 2.3.2.
How can one reproduce the bug?
Steps to reproduce the behavior. Create a "OpenSearchVectorSearch" in LangChain when using opensearch-py 2.4.0, and try to upload documents to it
What is the expected behavior?
A clear and concise description of what you expected to happen. Documents upload successfully
What is your host/environment?
Operating system, version. Python Lambda image. Specifically
public.ecr.aws/lambda/python:3.9
Do you have any screenshots?
If applicable, add screenshots to help explain your problem. No
Do you have any additional context?
Add any other context about the problem. No