run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.91k stars 5.28k forks source link

How to reuse embeddings when created in elastic search with denseX. #15589

Open rvssridatta opened 3 months ago

rvssridatta commented 3 months ago

Bug Description

How to store, reuse the stored embeddings created by elastic search and denseX.

Below is one approach where I facing issues if I store them in dictionary and trying to send it.

image

Also please provide additional information to get proper page number references as per generated response

Version

llama-index==0.10.12

Steps to Reproduce

def create_all_nodes_dict():

documents = SimpleDirectoryReader(input_dir=os.environ.get('DOC_PATH')).load_data()
demotest.info("Loading Documents")
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
demotest.info("Splitting the Sentences")
base_nodes = node_parser.get_nodes_from_documents(documents)
for idx, node in enumerate(base_nodes):
    node.id_ = f"node-{idx}"
demotest.info("Creating Embedding")

sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
    SentenceSplitter(chunk_size=c, chunk_overlap=100) for c in sub_chunk_sizes
]

all_nodes = []
for base_node in base_nodes:
    for n in sub_node_parsers:
        sub_nodes = n.get_nodes_from_documents([base_node])
        sub_inodes = [
            IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes
        ]
        all_nodes.extend(sub_inodes)

    # also add original node to node
    original_node = IndexNode.from_text_node(base_node, base_node.node_id)
    all_nodes.append(original_node)
all_nodes_dict = {n.node_id: n for n in all_nodes}

file_path = './pkl_folder/all_nodes_dict.pkl'
os.makedirs(os.path.dirname(file_path), exist_ok=True)
# Open the file in binary write mode ('wb') to save the dictionary
with open(file_path, 'wb') as f:
    pickle.dump(all_nodes_dict, f)

return all_nodes_dict

def define_rag(all_nodes): llm = llm_query() embed_model = embeded_model()

# update_status_check('0')
vector_store = ElasticsearchStore(
index_name=os.environ.get('INDEX_NAME'),
es_url=os.environ.get('ES_URL'),
# retrieval_strategy=AsyncDenseVectorStrategy(hybrid=True),
)

retriever_chunk = DenseXRetrievalPack(
    documents=all_nodes,
    query_llm=llm,
    vector_store=vector_store
)

query_engine_chunk = retriever_chunk.query_engine
return query_engine_chunk,retriever_chunk

all_nodes = create_all_nodes_dict() query_engine_chunk,retriever_chunk = define_rag(all_nodes)

response = query_engine_chunk.query("Current Query: "+ query)

Relevant Logs/Tracbacks

please refer above attached image for relevant logs.
dosubot[bot] commented 3 months ago

To store and reuse embeddings created by Elasticsearch and denseX, you can use the ElasticsearchEmbedding class provided in the llama_index package. This class allows you to generate embeddings using a model deployed in an Elasticsearch cluster. You can instantiate this class using either an existing Elasticsearch connection or by providing Elasticsearch credentials.

Here is an example of how to instantiate the ElasticsearchEmbedding class using an existing Elasticsearch connection:

from elasticsearch import Elasticsearch
from llama_index.legacy.embeddings.elasticsearch import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
input_field = "your_input_field"  # Optional, only if different from 'text_field'

# Create Elasticsearch connection
es_connection = Elasticsearch(hosts=["localhost:9200"], basic_auth=("user", "password"))

# Instantiate ElasticsearchEmbedding using the existing connection
embeddings = ElasticsearchEmbedding.from_es_connection(
    model_id,
    es_connection,
    input_field=input_field,
)

To resolve the AttributeError: 'str' object has no attribute 'id' in the llama_index package, ensure that you are passing the correct type of object to the method that is raising the error. The error suggests that a string is being used where an object with an id attribute is expected. Double-check the inputs to your methods and ensure they match the expected types.

For example, if you encounter an error like AttributeError: 'str' object has no attribute 'upsert', it indicates that a string is being passed where a vector store object is expected. Ensure that you are correctly initializing the vector store and passing it to the relevant methods.

Here is a corrected example based on a common issue:

from llama_index import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.vector_stores import PineconeVectorStore

# Load documents
loader = SimpleDirectoryReader('/path to data', recursive=True)
documents = loader.load_data()

# Initialize Pinecone vector store
pinecone_index = "your_pinecone_index"  # Ensure this is a valid Pinecone index object, not a string
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

# Create storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

If you need more specific guidance on resolving the AttributeError, please provide the relevant code snippet where the error occurs. The ElasticsearchEmbedding class expects an Elasticsearch connection object and a model ID, so make sure these are correctly provided.

Additionally, if you are working with composable indices and encounter an error like AttributeError: 'tuple' object has no attribute 'embedding', ensure that you are correctly handling the objects returned by your methods. For example, when creating a ComposableGraph and querying it, make sure that the objects being passed around are of the expected types and structures.

Here is an example of how to create and query a ComposableGraph:

from llama_index import ComposableGraph, ListIndex

# Assuming index1 and index2 are already created and are instances of GPTSimpleVectorIndex
graph = ComposableGraph.from_indices(
    ListIndex,
    [index1, index2],
    index_summaries=[
        "summary 1", 
        "summary 2"
    ],
)

# Query the graph
response_summary = graph.query(graph_query_str, query_configs=query_configs)

Ensure that index1 and index2 are valid index objects and not tuples or other types that do not have the required attributes [1][2][3][4].

To continue talking to Dosu, mention @dosu.