Closed mirodrr closed 1 year ago
I can reproduce this against OpenSearch Serverless w/2.4.1. Code in
This works with all of OSS OpenSearch, and Amazon Managed OpenSearch and Serverless.
#!/usr/bin/env python3
from os import environ
from typing import List
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
fake_texts = ["foo", "bar", "baz"]
class FakeEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
return [float(1.0)] * 9 + [float(0.0)]
docsearch = OpenSearchVectorSearch.from_texts(
http_auth=("admin", "admin")
docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
This works with OSS OpenSearch, Amazon Managed OpenSearch, but not with Serverless on 2.4.1.
#!/usr/bin/env python3
import logging
from os import environ
from typing import List
from urllib.parse import urlparse
from opensearchpy import AWSV4SignerAuth, OpenSearch, RequestsHttpConnection, __versionstr__
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
from boto3 import Session
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)
opensearch_url = environ['ENDPOINT']
url = urlparse(opensearch_url)
region = environ.get('AWS_REGION', 'us-east-1')
service = environ.get('SERVICE', 'es')
credentials = Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)
print(f"Using opensearch-py {__versionstr__}")
fake_texts = ["foo", "bar", "baz"]
class FakeEmbeddings(Embeddings):
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]
def embed_query(self, text: str) -> List[float]:
return [float(1.0)] * 9 + [float(0.0)]
docsearch = OpenSearchVectorSearch.from_texts(
docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
$ poetry run python
INFO:Found credentials in environment variables.
Using opensearch-py 2.3.2
WARNING:GET [status:404 request:0.306s]
INFO:PUT [status:200 request:0.346s]
INFO:POST [status:200 request:11.820s]
INFO:GET [status:200 request:0.330s]
INFO:POST [status:200 request:0.458s]
$ poetry run python
INFO:Found credentials in environment variables.
Using opensearch-py 2.4.1
WARNING:GET [status:404 request:0.386s]
INFO:PUT [status:200 request:0.356s]
INFO:POST [status:200 request:0.262s]
Traceback (most recent call last):
File "/Users/dblock/source/langchain/hello/", line 31, in <module>
docsearch = OpenSearchVectorSearch.from_texts(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/", line 777, in from_texts
return cls.from_embeddings(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/", line 901, in from_embeddings
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/", line 138, in _bulk_ingest_embeddings
bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/", line 425, in bulk
for ok, item in streaming_bulk(client, actions, ignore_status=ignore_status, *args, **kwargs): # type: ignore
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/", line 338, in streaming_bulk
for data, (ok, info) in zip(
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/", line 273, in _process_bulk_chunk
for item in gen:
File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/", line 202, in _process_bulk_chunk_success
raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
opensearchpy.helpers.errors.BulkIndexError: ('3 document(s) failed to index.', [{'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '3ac5cf95-7f07-4d95-a44f-0615bb76aad1', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '0d234ec6-f620-4013-870e-1a082b813c94', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '7c54ac77-3853-4c70-a01b-55dae4de1115', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}}}}])
Uses bulk
AOSS doesn't support passing _id in bulk. @dblock did we change anything from opensearch-py client?
@dblock I remember we added a check in langchain to identity a diff between aoss and AOS Ref: . it was done using service
attribute of AWSV4SignerAuth.
does that get changed?
I bisected this to
We are sending different data!
method: POST
data: b'{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{},"id":"2a31333a-974f-4641-baf7-41072f221329"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{},"id":"c04199df-3304-416c-a74e-5f6150cbfe22"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{},"id":"da0b79e7-8e5c-4945-ab77-d8be9d5c197e"}\n
method: POST
body: b'{"index":{"_id":"db460c2e-8775-4e0d-b040-2acab808817c","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{}}\n{"index":{"_id":"e2977ca9-3394-4f59-9c8e-f70dd5685c91","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{}}\n{"index":{"_id":"3bf8f6ed-17ed-4f90-a047-5dc0284f1f08","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{}}\n'
@dblock this is interesting. Are we raising a PR to fix this or reverting the change?
The problem is here:
The code relies on signer#service
, which is really an implementation detail, to figure out whether we're talking to AOS or AOSS. The data sent into bulk
changes in LangChain depending on that.
[{'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, 'id': 'b46ca273-f691-4b09-9ef8-6e78bd9d0e30'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, 'id': 'a981f234-1fb2-4646-9e1c-bd36a3de21ab'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, 'id': '5a10a420-7a3a-4902-bcec-b7f18f8dde21'}]
[{'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, '_id': '7a0c551c-1764-4e93-9288-0185671a1cf1'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, '_id': '54f9d0d2-6ba9-4bc6-96ab-468400e35790'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, '_id': 'c1a0cc14-80ae-48f7-9b21-475a6fbf5022'}]
@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it
@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it
I don't believe that's correct, see
We've released 2.4.2 with a fix.
Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock
Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock
Can you please open an issue in opensearch-js?
What is the bug?
A clear and concise description of the bug. When using LangChain OpenSearchVectorSearch with OpenSearchServerless and doing:
I get the error
This same code works fine on 2.3.2.
How can one reproduce the bug?
Steps to reproduce the behavior. Create a "OpenSearchVectorSearch" in LangChain when using opensearch-py 2.4.0, and try to upload documents to it
What is the expected behavior?
A clear and concise description of what you expected to happen. Documents upload successfully
What is your host/environment?
Operating system, version. Python Lambda image. Specifically
Do you have any screenshots?
If applicable, add screenshots to help explain your problem. No
Do you have any additional context?
Add any other context about the problem. No