Zaaccckkkk commented 4 months ago

Bug Description

I try to reproduce the work https://github.com/run-llama/llama_parse/blob/main/examples/demo_advanced.ipynb. I succeed once. However, after that, I failed every time. The error exists regardless of the file I use. The error message: """ python llamaparse.py Started parsing the file under job_id cac11eca-1058-4312-a42e-3203694e3962 1it [00:00, 5236.33it/s] 100%	████████████████████████████████████████████████████████████████████████████████████████████████████████	1/1 [00:07<00:00, 7.19s/it]	Disability Category	Participants	Ballots Completed
Blind	5	1	4	34.5%, n=1	1199 sec, n=1
Low Vision	5	2	3	98.3% n=2	1716 sec, n=3
				(97.7%, n=3)	(1934 sec, n=2)
Dexterity	5	4	1	98.3%, n=4	1672.1 sec, n=4
Mobility	3	3	0	95.4%, n=3	1416 sec, n=3

2 Traceback (most recent call last): File "/Users/wuzhentian/Desktop/LlamaParse_play/llamaparse.py", line 104, in response_1 = raw_query_engine.query(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper result = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/base/base_query_engine.py", line 52, in query query_result = self._query(str_or_query_bundle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper result = func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py", line 189, in _query nodes = self.retrieve(query_bundle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py", line 145, in retrieve return self._apply_node_postprocessors(nodes, query_bundle=query_bundle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/query_engine/retriever_query_engine.py", line 138, in _apply_node_postprocessors nodes = node_postprocessor.postprocess_nodes( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/postprocessor/types.py", line 56, in postprocess_nodes return self._postprocess_nodes(nodes, query_bundle) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/core/instrumentation/dispatcher.py", line 230, in wrapper result = func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/wuzhentian/Desktop/LlamaParse_play/Llama_venv/lib/python3.11/site-packages/llama_index/postprocessor/flag_embedding_reranker/base.py", line 84, in _postprocess_nodes scores = self._model.compute_score(query_and_nodes) ^^^^^^^^^^^ AttributeError: 'FlagEmbeddingReranker' object has no attribute '_model'. Did you mean: 'model'? """

Version

0.10.40

Steps to Reproduce

requirements.txt: llama-index-llms-huggingface llama-index-embeddings-huggingface transformers accelerate bitsandbytes llama-index llama-index-core==0.10.50.post1 llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git llama-parse python-dotenv llama-index-embeddings-openai

llamaparse.py: import os from dotenv import load_dotenv, find_dotenv import nest_asyncio from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.core import VectorStoreIndex, Settings from llama_parse import LlamaParse from copy import deepcopy from llama_index.core.schema import TextNode from llama_index.core.node_parser import MarkdownElementNodeParser from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker from llama_index.core import SimpleDirectoryReader import pickle

Load environment variables from .env file

load_dotenv(find_dotenv()) secret_value_0 = os.getenv('LLAMACLOUD_API_KEY') secret_value_1 = os.getenv('OPENAI_API_KEY')

Apply nest_asyncio for running async code in sync environment

nest_asyncio.apply()

Initialize embedding and language models

embed_model = OpenAIEmbedding(model="text-embedding-3-small") llm = OpenAI(model="gpt-3.5-turbo-0125") Settings.llm = llm Settings.embed_model = embed_model

Load documents using LlamaParse

documents = LlamaParse(result_type="markdown", api_key=secret_value_0).load_data("/Users/wuzhentian/Desktop/example_table.pdf")

def get_page_nodes(docs, separator="\n---\n"): """Split each document into page nodes by separator.""" nodes = [] for doc in docs: doc_chunks = doc.text.split(separator) for doc_chunk in doc_chunks: node = TextNode( text=doc_chunk, metadata=deepcopy(doc.metadata), ) nodes.append(node) return nodes

Parse documents into nodes

page_nodes = get_page_nodes(documents) node_parser = MarkdownElementNodeParser( llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8 ) nodes = node_parser.get_nodes_from_documents(documents) base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

Print the content of the first node

print(page_nodes[0].get_content())

Create vector index with nodes

recursive_index = VectorStoreIndex(nodes=base_nodes + objects + page_nodes)

Cache the reranker model

reranker_filename = 'reranker_model.pkl'

def save_model(model, filename): with open(filename, 'wb') as f: pickle.dump(model, f)

def load_model(filename): with open(filename, 'rb') as f: return pickle.load(f)

Check if cached reranker model exists

if os.path.exists(reranker_filename): reranker = load_model(reranker_filename) else: reranker = FlagEmbeddingReranker( top_n=5, model="BAAI/bge-reranker-large", ) save_model(reranker, reranker_filename)

recursive_query_engine = recursive_index.as_query_engine( similarity_top_k=5, node_postprocessors=[reranker], verbose=True )

print(len(nodes))

Setup the raw query engine

file_path = "/Users/wuzhentian/Desktop/example_table.pdf" if not os.path.exists(file_path): raise ValueError(f"File {file_path} does not exist.")

reader = SimpleDirectoryReader(input_files=[file_path]) base_docs = reader.load_data() raw_index = VectorStoreIndex.from_documents(base_docs) raw_query_engine = raw_index.as_query_engine( similarity_top_k=5, node_postprocessors=[reranker] )

Now run your query

query = "How many blind participants?" response_1 = raw_query_engine.query(query) print("\nBasic Query Engine") print(response_1)

response_2 = recursive_query_engine.query(query) print("\nNew LlamaParse+ Recursive Retriever Query Engine") print(response_2)

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 4 months ago

Hello, @Zaaccckkkk! I'm here to help you with any bugs or questions you have. If you need assistance, feel free to ask.

To fix the AttributeError: 'FlagEmbeddingReranker' object has no attribute '_model', ensure that the _model attribute is properly initialized in the __init__ method of the FlagEmbeddingReranker class. Here are the steps to resolve the issue:

Ensure Proper Initialization: Make sure the _model attribute is initialized correctly in the FlagEmbeddingReranker class. The initialization should look like this:

def __init__(
    self,
    top_n: int = 2,
    model: str = "BAAI/bge-reranker-large",
    use_fp16: bool = False,
) -> None:
    try:
        from FlagEmbedding import FlagReranker
    except ImportError:
        raise ImportError(
            "Cannot import FlagReranker package, please install it: ",
            "pip install git+https://github.com/FlagOpen/FlagEmbedding.git",
        )
    self._model = FlagReranker(
        model,
        use_fp16=use_fp16,
    )
    super().__init__(top_n=top_n, model=model, use_fp16=use_fp16)

Check the Import and Installation: Ensure that the FlagEmbedding package is correctly installed and the FlagReranker class is correctly imported. You can install the package using:
```
pip install git+https://github.com/FlagOpen/FlagEmbedding.git
```
Correct Instantiation: Verify that the FlagEmbeddingReranker object is instantiated correctly in your script. For example:
```
reranker = FlagEmbeddingReranker(top_n=5, model="BAAI/bge-reranker-large")
```

Define _model as a Private Attribute: Ensure that the _model attribute is defined as a PrivateAttr in the FlagEmbeddingReranker class:

from llama_index.legacy.bridge.pydantic import Field, PrivateAttr

class FlagEmbeddingReranker(BaseNodePostprocessor):
    """Flag Embedding Reranker."""

    model: str = Field(description="BAAI Reranker model name.")
    top_n: int = Field(description="Number of nodes to return sorted by score.")
    use_fp16: bool = Field(description="Whether to use fp16 for inference.")
    _model: Any = PrivateAttr()

By following these steps, you should be able to resolve the AttributeError and ensure that the _model attribute is properly initialized and accessible within the FlagEmbeddingReranker class ^[1]^[2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 months ago

@Zaaccckkkk you can't pickle an embedding model 😅

Zaaccckkkk commented 4 months ago

Right, after deleting the pickle thing, it works for once. However, it fails for all further run.... Error message: " Error while parsing the file '/Users/wuzhentian/Desktop/example_table.pdf': Traceback (most recent call last): File "/Users/wuzhentian/Desktop/LlamaParse_play/llamaparse.py", line 56, in print("the page node is:", page_nodes[0].get_content())


IndexError: list index out of range
"

Zaaccckkkk commented 4 months ago

It works again after changing an Internet. This error can be caused by poor Internet connection, right?

logan-markewich commented 4 months ago

I think so, seems like there were zero page nodes

run-llama / llama_index

[Bug]: 'FlagEmbeddingReranker' object has no attribute '_model' #14567

Bug Description

Version

Steps to Reproduce

Load environment variables from .env file

Apply nest_asyncio for running async code in sync environment

Initialize embedding and language models

Load documents using LlamaParse

Parse documents into nodes

Print the content of the first node

Create vector index with nodes

Cache the reranker model

Check if cached reranker model exists

Setup the raw query engine

Now run your query

Relevant Logs/Tracbacks