run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.91k stars 5.29k forks source link

[Bug]: "AttributeError: 'LangchainEmbedding' object has no attribute '_langchain_embedding' #10475

Closed austinmw closed 6 months ago

austinmw commented 9 months ago

Bug Description

"AttributeError: 'LangchainEmbedding' object has no attribute '_langchain_embedding'

Version

0.9.44

Steps to Reproduce

requirements.txt:

boto3
sagemaker
llama-index==0.9.44
llama-hub
langchain
langchain-community
syne-tune
pypdf
unstructured[docx,pdf,pptx]
nltk

entry_point.py:

%%writefile code/entry_point.py
# entry_point.py

import argparse
import os
import boto3
import json
import numpy as np
from pathlib import Path

from syne_tune import Reporter
from syne_tune.constants import (
    ST_CHECKPOINT_DIR,
    ST_INSTANCE_COUNT,
    ST_INSTANCE_TYPE
)

from llama_index import (
    Document,
    VectorStoreIndex,
    load_index_from_storage,
    StorageContext,
    ServiceContext,
)

from llama_index.node_parser import (
    #SimpleNodeParser,
    TokenTextSplitter,
    SentenceSplitter,
    HTMLNodeParser,
    MarkdownNodeParser,
    SentenceWindowNodeParser,
    SemanticSplitterNodeParser,
    HierarchicalNodeParser,
    MarkdownElementNodeParser,
    MetadataAwareTextSplitter,
    LangchainNodeParser, # Use this by default!
    UnstructuredElementNodeParser,
)
from llama_index.ingestion import IngestionPipeline
from llama_index.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index import SimpleDirectoryReader
from llama_index.param_tuner.base import RunResult
from llama_hub.file.pdf.base import PDFReader
from llama_hub.file.unstructured.base import UnstructuredReader
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

from langchain.llms.bedrock import Bedrock
from langchain_community.embeddings import BedrockEmbeddings

from llama_index.llms import LangChainLLM
from llama_index.embeddings import LangchainEmbedding

from llama_index import evaluation
from llama_index.evaluation import (
    eval_utils,
    BatchEvalRunner,
    RetrieverEvaluator,
    AnswerRelevancyEvaluator,
    ContextRelevancyEvaluator,
    SemanticSimilarityEvaluator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
)

report = Reporter()

def _get_service_context(llm_model_id, embedding_model_id, temperature):

    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1",
    )

    bedrock_embedding = BedrockEmbeddings(
        model_id="amazon.titan-embed-text-v1",
        client=bedrock_runtime,
        region_name="us-east-1",
    )

    bedrock_llm = Bedrock(
        model_id="anthropic.claude-v2:1",
        client=bedrock_runtime, region_name="us-east-1",
        #model_kwargs={"temperature": 0.2},
    )

    embed_model = LangchainEmbedding(bedrock_embedding)

    llm_model = LangChainLLM(bedrock_llm)

    # Setting global service context
    service_context = ServiceContext.from_defaults(
        llm=llm_model,
        embed_model=embed_model,
    )
    #set_global_service_context(service_context)
    return service_context

def _build_index(args, docs, service_context):
    print("Building index...")

    index_out_path = f"./storage_{args.chunk_size}"
    if not os.path.exists(index_out_path):
        Path(index_out_path).mkdir(parents=True, exist_ok=True)

        # Define the splitter class configurations
        splitter_class_configurations = {
            'SentenceSplitter': {
                'class': SentenceSplitter,
                'arg_names': ['chunk_size', 'chunk_overlap'],
            },
            'TokenTextSplitter': {
                'class': TokenTextSplitter,
                'arg_names': ['chunk_size', 'chunk_overlap'],
            },
            'MarkdownNodeParser': {
                'class': MarkdownNodeParser,
                'arg_names': ['include_metadata'],
            },
            # ... (other classes and configurations can be added similarly)
        }

        # Define the parser class configurations
        parser_class_configurations = {
            'SentenceWindowNodeParser': {
                'window_size': 3,
                'window_metadata_key': 'window',
                'original_text_metadata_key': 'original_text',
            },
            # ... (other classes and configurations can be added similarly)
        }

        # Get the configuration for the selected class, default to a certain class if needed
        selected_config = splitter_class_configurations.get(
            args.splitter,
            splitter_class_configurations['SentenceSplitter']
        )

        # Retrieve the necessary arguments from args based on the selected configuration
        init_args = {name: getattr(args, name, None) for name in selected_config['arg_names']}
        print(f"Node parser init args: {init_args}")

        # Instantiate the class with the retrieved arguments
        text_splitter = selected_config['class'].from_defaults(**init_args)
        # # You can also wrap any existing text splitter from langchain with a node parser
        #from langchain.text_splitter import RecursiveCharacterTextSplitter
        #text_splitter = LangchainNodeParser(RecursiveCharacterTextSplitter())

        #title_extractor = TitleExtractor(nodes=5, llm=service_context.llm)
        #qa_extractor = QuestionsAnsweredExtractor(questions=3, llm=service_context.llm)

        service_context.transformations = [
            text_splitter,
            #title_extractor,
            #qa_extractor
        ]

        pipeline = IngestionPipeline.from_service_context(service_context)

        # Parse the documents
        #base_nodes = node_parser.get_nodes_from_documents(docs)
        base_nodes = pipeline.run(
            documents=docs,
            in_place=True,
            show_progress=True,
            num_workers=4,
        )

        # build index
        index = VectorStoreIndex(base_nodes, service_context=service_context)
        # save index to disk
        index.storage_context.persist(index_out_path)
    else:
        # rebuild storage context
        storage_context = StorageContext.from_defaults(
            persist_dir=index_out_path,
        )
        # load index
        index = load_index_from_storage(
            storage_context=storage_context,
            service_context=service_context,
        )

    print("Index built.")
    return index

def _get_eval_batch_runner(service_context):

    # Evaluates if the retrieved context is relevant to the query
    context_query_relevancy = ContextRelevancyEvaluator(
        service_context=service_context,
    )
    # Evaluates whether response is relevant to the retrieved context
    answer_context_relevancy = RelevancyEvaluator(
        service_context=service_context,
    )
    # Evaluates whether response is relevant to the query
    answer_query_relevancy = AnswerRelevancyEvaluator(
        service_context=service_context,
    )
    # Evaluates the embedding similarity between the response and the reference answer
    # Uses the service context's embed_model
    similarity = SemanticSimilarityEvaluator(
        service_context=service_context,
    )
    # Scores the holistic relevance and correctness of the response between 1 and 5
    correctness = CorrectnessEvaluator(
        service_context=service_context,
    )
    # Evaluates if the response is supported by the retrieved context
    faithfulness = FaithfulnessEvaluator(
        service_context=service_context,
    )

    eval_batch_runner = BatchEvalRunner(
        {
            "context_query_relevancy": context_query_relevancy,
            "answer_context_relevancy": answer_context_relevancy,
            "answer_query_relevancy": answer_query_relevancy,
            "semantic_similarity": similarity,
            "correctness": correctness,
            #"faithfulness": faithfulness,
            #"relevancy": relevancy,
        },
        workers=8, show_progress=True
    )

    return eval_batch_runner

async def evaluate_dataset_async(retriever_evaluator, eval_dataset):
    return await retriever_evaluator.aevaluate_dataset(eval_dataset)

def main(args):
    # Load documents
    data_dir = Path(args.data_dir)
    #loader = PDFReader()
    #docs0 = loader.load_data(file=data_dir / "llama2.pdf")

    dir_reader = SimpleDirectoryReader(
    #input_dir='./data',  # TODO: move evaluation dataset to another directory
    input_files=[data_dir / "llama2.pdf"],
    file_extractor={
        ".pdf": UnstructuredReader(),
        ".html": UnstructuredReader(),
    })
    docs0 = dir_reader.load_data()

    doc_text = "\n\n".join([d.get_content() for d in docs0])
    docs = [Document(text=doc_text)]

    # Load evaluation dataset
    eval_dataset = evaluation.QueryResponseDataset.from_json(
        data_dir / "llama2_eval_qr_dataset.json"
    )
    eval_qs = eval_dataset.questions
    ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]

    # service context
    service_context = _get_service_context(
        args.llm_model_id, args.embedding_model_id, args.temperature
    )

    # Build index
    index = _build_index(args, docs, service_context)

    # Retriever
    retriever = index.as_retriever(similarity_top_k=args.top_k)

    # Retrieval metrics
    metrics = ["mrr", "hit_rate"]
    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        metrics, retriever=retriever
    )

    # Get retrieval metrics
    retrieval_eval_results = asyncio.run(evaluate_dataset_async(retriever_evaluator, eval_dataset))

    # Query engine
    query_engine = index.as_query_engine(similarity_top_k=args.top_k)

    # Get predicted responses
    pred_response_objs = eval_utils.get_responses(
        eval_qs, query_engine, show_progress=True
    )

    # Run evaluator
    eval_batch_runner = _get_eval_batch_runner(service_context)
    eval_results = eval_batch_runner.evaluate_responses(
        eval_qs, responses=pred_response_objs, reference=ref_response_strs
    )

    # Get semantic similarity metric
    mean_semantic_similarity = np.array(
        [r.score for r in eval_results["semantic_similarity"]]
    ).mean()

    # Get mean of each metric
    mean_results = {}
    for kev, ev in eval_results.items():
        mean_results[kev] = np.array([r.score for r in ev]).mean()
        print(f"{kev}: {mean_results[kev]}")

    # Save results
    result = RunResult(score=mean_semantic_similarity, params={"chunk_size": args.chunk_size, "top_k": args.top_k})
    with open(os.path.join(args.model_dir, 'result.json'), 'w') as f:
        f.write(result.json())

    #report(mean_semantic_similarity=mean_semantic_similarity)
    report(**mean_results)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Hyperparameters
    parser.add_argument('--chunk_size', type=int, default=256)
    parser.add_argument('--chunk_overlap', type=int, default=20)
    parser.add_argument('--top_k', type=int, default=1)
    parser.add_argument('--llm_model_id', type=str, default="anthropic.claude-v2:1")
    parser.add_argument('--embedding_model_id', type=str, default="amazon.titan-embed-text-v1")
    parser.add_argument('--temperature', type=float, default=0.0)
    parser.add_argument('--splitter', type=str, default="SentenceSplitter",
                        choices=["SentenceSplitter", "RecursiveCharacterTextSplitter"])

    # SageMaker specific arguments
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--data-dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument(f"--{ST_CHECKPOINT_DIR}", type=str)
    parser.add_argument(f"--{ST_INSTANCE_COUNT}", type=str)
    parser.add_argument(f"--{ST_INSTANCE_TYPE}", type=str)

    args = parser.parse_args()

    main(args)

Relevant Logs/Tracbacks

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/opt/conda/lib/python3.10/site-packages/llama_index/ingestion/pipeline.py", line 80, in run_transformations
    nodes = transform(nodes, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/llama_index/core/embeddings/base.py", line 335, in __call__
    embeddings = self.get_text_embedding_batch(
  File "/opt/conda/lib/python3.10/site-packages/llama_index/core/embeddings/base.py", line 256, in get_text_embedding_batch
    embeddings = self._get_text_embeddings(cur_batch)
  File "/opt/conda/lib/python3.10/site-packages/llama_index/embeddings/langchain.py", line 84, in _get_text_embeddings
    return self._langchain_embedding.embed_documents(texts)
AttributeError: 'LangchainEmbedding' object has no attribute '_langchain_embedding'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/opt/ml/code/entry_point.py", line 342, in <module>
    main(args)
  File "/opt/ml/code/entry_point.py", line 272, in main
    index = _build_index(args, docs, service_context)
  File "/opt/ml/code/entry_point.py", line 167, in _build_index
    base_nodes = pipeline.run(
  File "/opt/conda/lib/python3.10/site-packages/llama_index/ingestion/pipeline.py", line 454, in run
    nodes_parallel = p.starmap(
  File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
AttributeError: 'LangchainEmbedding' object has no attribute '_langchain_embedding'
dosubot[bot] commented 9 months ago

🤖

Hey @austinmw, great to see you diving into the depths of LlamaIndex again! How's everything going on your end?

The error message you're encountering, "AttributeError: 'LangchainEmbedding' object has no attribute '_langchain_embedding'", suggests that the '_langchain_embedding' attribute is not found in the 'LangchainEmbedding' object.

In the LlamaIndex codebase, the '_langchain_embedding' attribute is clearly defined in the 'LangchainEmbedding' class and is assigned a value in the 'init' method. This attribute is used to store the instance of the Langchain embeddings class, which is used to generate embeddings for queries and documents.

This error could occur if you are trying to access the '_langchain_embedding' attribute before it is initialized. Make sure you are not trying to access this attribute before the 'LangchainEmbedding' object is fully initialized.

If you are still encountering this error, it could be due to a version mismatch. The provided context is for version 0.9.44 of the llama-index library. If you are using a different version of the library, the '_langchain_embedding' attribute may not exist or may have a different name.

Please ensure that you are using the correct version of the library and that the 'LangchainEmbedding' object is fully initialized before trying to access the '_langchain_embedding' attribute.

If you've checked all of these and the issue persists, it would be helpful to see the full code of your script, especially the initialization of the LangchainEmbedding object and the part where you call the _build_index function. This will help to better understand the context in which the error is occurring.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

austinmw commented 9 months ago

I've initialized the embedding model on this line: embed_model = LangchainEmbedding(bedrock_embedding)

logan-markewich commented 9 months ago

@austinmw ah, this is related to using multiprocessing :( I had added some (slightly hacky) code to handle pickling needed for multiprocessing, seems like its destroying the private variable here

will see if I can fix that

austinmw commented 9 months ago

@logan-markewich Thanks!

logan-markewich commented 9 months ago

@austinmw I tried playing around with this a bit.

The main logic happening is in the getstate and setstate here https://github.com/run-llama/llama_index/blob/c70abf65102de37d5fd78c2efbf0378de91d3e4e/llama_index/schema.py#L65

I'm not able to get something working without breaking other things. I think the solution here is to try and multiprocess one level higher, or instead of multiprocessing, use async operations (although langchain LLMs do not support async at the moment)

At the end of the day, not everything is pickleable 😢

One other thing to try is also using our bedrock abstractions instead of langchain (ours will support async, and might even work with multiprocessing here)

https://docs.llamaindex.ai/en/stable/examples/llm/bedrock.html#bedrock https://docs.llamaindex.ai/en/stable/examples/embeddings/bedrock.html#bedrock-embeddings

austinmw commented 9 months ago

Hey @logan-markewich , thanks for looking into it. I actually switched to the LangChain wrappers as a recommendation from another issue thread which said the Bedrock abstractions were not currently well maintained in LlamaIndex (I can't remember exactly what error I was facing at the moment).

Edit: This was the issue that caused me to switch to LangChain wrappers: https://github.com/run-llama/llama_index/issues/9812

logan-markewich commented 9 months ago

@austinmw I just merged/released some refactors the other day -- supposedly they work (I don't have access to test bedrock)

dosubot[bot] commented 6 months ago

Hi, @austinmw,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The issue involves a bug causing an AttributeError when running the code and building the index. There have been discussions in the comments about potential solutions, including handling pickling for multiprocessing and considering the use of bedrock abstractions instead of langchain. It was mentioned that some refactors were merged and released, but it's unclear if they address the specific issue.

Could you please let us know if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please comment on the issue to let the LlamaIndex team know. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or updates, feel free to reach out.