run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.39k stars 4.68k forks source link

[Question]: Connecting llamaindex to aoss opensearch on aws #14133

Open mike-wethos opened 3 weeks ago

mike-wethos commented 3 weeks ago

Question Validation

Question

Hello,

I am attempting to store documents in aws aoss opensearch and keep getting a 403 forbidden when creating the opensearchVectorClient despite passing correct credentials. Is this a supported method?

import boto3
from llama_index.core import Settings, StorageContext, VectorStoreIndex
from llama_index.readers.s3 import S3Reader
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from opensearchpy import AWSV4SignerAuth

        credentials = boto3.Session(
            aws_access_key_id=env.aws_access_key_id,
            aws_secret_access_key=env.aws_secret_access_key,
        ).get_credentials()
        region = "us-east-2"
        service = "aoss"
        auth = AWSV4SignerAuth(credentials, region, service)

        endpoint = env.vector_store_host
        idx = args.index_type
        text_field = "content"
        embedding_field = "embedding"
        client = OpensearchVectorClient(
            endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, http_auth=auth
        )

thank you for your time.

dosubot[bot] commented 3 weeks ago

Hey @mike-wethos! 👋 I'm here to help you out with your question. Let's tackle this together!

The OpensearchVectorClient class does support the method you are using to create an instance with AWS credentials. However, the 403 Forbidden error you are encountering could be due to several reasons, such as insufficient permissions for the provided AWS credentials or incorrect configuration of the AWSV4SignerAuth.

Here is a checklist to help you troubleshoot the issue:

  1. Verify AWS Credentials: Ensure that the AWS credentials have the necessary permissions to access the OpenSearch service. The credentials should have policies that allow actions like es:ESHttpGet, es:ESHttpPut, es:ESHttpPost, etc.

  2. Check Region and Service: Make sure that the region and service parameters are correctly set. For OpenSearch, the service should be es instead of aoss.

  3. Endpoint URL: Ensure that the endpoint URL is correct and accessible.

Here is an updated version of your code snippet with the correct service name:

import boto3
from llama_index.core import Settings, StorageContext, VectorStoreIndex
from llama_index.readers.s3 import S3Reader
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from opensearchpy import AWSV4SignerAuth

credentials = boto3.Session(
    aws_access_key_id=env.aws_access_key_id,
    aws_secret_access_key=env.aws_secret_access_key,
).get_credentials()
region = "us-east-2"
service = "es"  # Correct service name for OpenSearch
auth = AWSV4SignerAuth(credentials, region, service)

endpoint = env.vector_store_host
idx = args.index_type
text_field = "content"
embedding_field = "embedding"
client = OpensearchVectorClient(
    endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, kwargs={"http_auth": auth}
)

If the issue persists after these checks, you may need to review the IAM policies attached to the AWS credentials and ensure they have the necessary permissions to interact with the OpenSearch service. Additionally, ensure that the OpensearchVectorClient is correctly initialized and that the OpenSearch client is properly configured to handle the AWS authentication [1][2].

To continue talking to Dosu, mention @dosu.

mike-wethos commented 3 weeks ago

@dosu I verified that the aws credentials are correct however I did not add the :443 port to the endpoint. After adding it I now get a timeout exception ConnectionTimeout caused by - TimeoutError(). I have verified that I can connect and create an index using the opensearch-py library using the above auth and endpoint parameters

dosubot[bot] commented 3 weeks ago

The ConnectionTimeout error you're encountering can be due to several factors, including network or configuration issues with your OpenSearch server, the overhead of bulk ingestion operations, server performance and configuration, or the client's timeout settings being too low for the operation.

Here are some steps to address these issues:

  1. Verify Server Endpoint and Accessibility: Ensure the OpenSearch server's endpoint is correct and that the server is accessible. Incorrect endpoint configuration or server unavailability can lead to timeout errors.

  2. Check Network Connectivity: Ensure there's stable network connectivity between your client and the OpenSearch server.

  3. Review Server Performance: The OpenSearch server's performance and configuration should be adequate to handle the workload. If the server is under heavy load, consider optimizing its resources.

  4. Adjust Bulk Operation Settings: If you're ingesting large amounts of data, consider adjusting the max_chunk_bytes parameter to better suit your data size and reduce the load on the server. For example, you can increase the max_chunk_bytes value to something larger, such as 5 * 1024 * 1024 (5 MB) or more, depending on your data size and network conditions.

  5. Modify Client Timeout Settings: Increase the timeout settings of the OpensearchVectorClient to accommodate operations that take longer to complete. You can set the timeout parameter when initializing the client:

    from llama_index.vector_stores.opensearch import OpensearchVectorClient
    
    client = OpensearchVectorClient(
        endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, timeout=60, kwargs={"http_auth": auth}
    )
  6. Configure AWS Service Client with Higher Timeout: Ensure that the AWS service client is configured with a higher timeout value to prevent premature timeouts:

    import boto3
    import botocore
    
    def get_aws_service_client(
        service_name, region_name, aws_access_key_id, aws_secret_access_key, timeout=60.0
    ):
        config = botocore.config.Config(
            retries={"max_attempts": 3, "mode": "standard"},
            connect_timeout=timeout,
        )
        session = boto3.Session(
            aws_access_key_id=aws_access_key_id,
            aws_secret_access_key=aws_secret_access_key,
            region_name=region_name,
        )
        return session.client(service_name, config=config)
    
    client = get_aws_service_client("es", "us-east-2", env.aws_access_key_id, env.aws_secret_access_key)

By following these steps, you should be able to mitigate the timeout exceptions when storing documents in AWS AOSS OpenSearch [1][2][3].

To continue talking to Dosu, mention @dosu.

mike-wethos commented 3 weeks ago

I am still getting the connection timeout

dosubot[bot] commented 3 weeks ago

To resolve the ConnectionTimeout error when creating the OpensearchVectorClient in AWS AOSS OpenSearch, ensure that the http_auth parameter is correctly set to indicate that the service is aoss. This is necessary for the _is_aoss_enabled method to recognize and handle the AOSS-specific configurations.

Here is an example of how to correctly set the http_auth parameter:

from opensearchpy import AWSV4SignerAuth
import boto3

# Create the AWS credentials
credentials = boto3.Session().get_credentials()
region = 'your-aws-region'

# Create the AWSV4SignerAuth object
http_auth = AWSV4SignerAuth(credentials, region)

# Initialize the OpensearchVectorClient with the correct http_auth
client = OpensearchVectorClient(
    endpoint="https://your-opensearch-endpoint",
    index="your-index-name",
    dim=1536,
    embedding_field="embedding",
    text_field="content",
    http_auth=http_auth
)

Ensure that you replace "your-aws-region", "https://your-opensearch-endpoint", and "your-index-name" with your actual AWS region, OpenSearch endpoint, and index name respectively. This should help in resolving the ConnectionTimeout error by correctly configuring the client for AWS AOSS [1].

To continue talking to Dosu, mention @dosu.

mike-wethos commented 3 weeks ago

With this setup

        # aws setup
        credentials = boto3.Session(
            aws_access_key_id=env.aws_access_key_id,
            aws_secret_access_key=env.aws_secret_access_key,
        ).get_credentials()

        region = "us-east-2"
        service = "aoss"
        auth = AWSV4SignerAuth(credentials, region, service)

        endpoint = env.vector_store_host
        idx = args.index_type
        text_field = "content"
        embedding_field = "embedding"
        client = OpensearchVectorClient(
             endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, http_auth=auth
        )

I am now getting a 'AWSV4SignerAuth' object has no attribute 'encode' error

mike-wethos commented 3 weeks ago

@dosu I fixed that error by adding connection_class=RequestsHttpConnection to the call and that fixed the AWSV4signerAuth error, but now the the code just hangs with no error when attempting to create the client

        # aws setup
        credentials = boto3.Session(
            aws_access_key_id=env.aws_access_key_id,
            aws_secret_access_key=env.aws_secret_access_key,
        ).get_credentials()

        region = "us-east-2"
        service = "aoss"
        auth = AWSV4SignerAuth(credentials, region, service)

        endpoint = env.vector_store_host
        idx = args.index_type
        text_field = "content"
        embedding_field = "embedding"
        client = OpensearchVectorClient(
            endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, http_auth=auth,
            connection_class=RequestsHttpConnection, use_ssl=True, verify_certs=True
        )

It finally failed with

ConnectionError(HTTPSConnectionPool(host='hostHere', port=433): Max retries exceeded with url: /enterprise (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff22f577550>: Failed to establish a new connection: [Errno 101] Network is unreachable'))) caused by: ConnectionError(HTTPSConnectionPool(host='hostHere', port=433): Max retries exceeded with url: /enterprise (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff22f577550>: Failed to establish a new connection: [Errno 101] Network is unreachable')))

I have verified that I can successfully create an index using these credentials and host/port using opensearch-py. The issue seems to step from the fact that llamaindex only supports sending the host/port as a string rather than the standard dict that opensearch-py uses in its examples.

client = OpenSearch(
    hosts = [{'host': endpoint, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)
Georgehe4 commented 2 weeks ago

Are you encoding the port in your host when you use the vector client? It looks like opensearch-py should be standardizing the input string as long as the port has the port is encoded in the string, i.e. "localhost:442"

dosubot[bot] commented 2 weeks ago

To ensure compatibility with llama_index when using OpensearchVectorClient, you should encode the port in the endpoint string as part of the URL. For example, if your OpenSearch instance is running on port 443, your endpoint string should look like this:

endpoint = "https://your-opensearch-endpoint:443"

Here is the updated code snippet with the port included in the endpoint string:

from opensearchpy import AWSV4SignerAuth, RequestsHttpConnection
import boto3

# AWS setup
credentials = boto3.Session(
    aws_access_key_id=env.aws_access_key_id,
    aws_secret_access_key=env.aws_secret_access_key,
).get_credentials()

region = "us-east-2"
service = "aoss"
auth = AWSV4SignerAuth(credentials, region, service)

endpoint = "https://your-opensearch-endpoint:443"
idx = args.index_type
text_field = "content"
embedding_field = "embedding"

client = OpensearchVectorClient(
    endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field, http_auth=auth,
    connection_class=RequestsHttpConnection, use_ssl=True, verify_certs=True
)

This should help resolve the connection issues by ensuring the port is correctly included in the endpoint URL, making it compatible with the OpensearchVectorClient [1].

To continue talking to Dosu, mention @dosu.

mike-wethos commented 2 weeks ago

Yes i have tried every variation of the url with no luck. I think an ideal solution would to update OpensearchVectorClientclient to better match the standard of hosts = [{'host': endpoint, 'port': 443}] that is seen in the opensearch-py docs. The current solution does not seem to work at all with aoss aws opensearch servers.