run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.57k stars 5.23k forks source link

[Question]: MAP consistently evaluates to zero. #16361

Open aclifton314 opened 1 month ago

aclifton314 commented 1 month ago

Question Validation

Question

llama-index: 0.10.62 Python 3.11.9

Hi Llama-Index Community!

I think I am messing something up when trying to calculate Mean Average Precision but I am not entirely sure and could use the help of the community. Here is some sample code:

Imports:

import weaviate
import random
import numpy as np

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext,
)

from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai_like import OpenAILike
from llama_index.core.query_engine import FLAREInstructQueryEngine
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.core.evaluation import RetrieverEvaluator

A class to create a FLAREInstructQueryEngine:

class MyFlareQueryEngine:
    def __init__(self) -> None:
        '''
        Initialize MyFlareQueryEngine by 
        setting an embedding mode, llm, creating a WeaviateVectorStore, 
        and creating a vector store from a directory of pdfs.
        '''

        self.data_dir = "/home/my/data/dir"

        Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

        # Llama-3-Groq-8B-Tool-Use.gguf
        self.llm = OpenAILike(
            model=model,
            api_base=api_base,
            api_key=api_key,
            max_tokens=1024,
        ) 
        Settings.llm = self.llm

        self.client = weaviate.connect_to_local(host="localhost", port=8080)

        vector_store = WeaviateVectorStore(
            weaviate_client=self.client,
            index_name="My_index",
        )

        self._load_documents(vector_store)
        query_engine = self.index.as_query_engine(llm=Settings.llm)
        self.flare_query_engine = FLAREInstructQueryEngine(
            query_engine=query_engine, max_iterations=1, verbose=True
        )

    def _load_documents(self, vectorstore: WeaviateVectorStore):
        '''
        Create a VectorStoreIndex from documents created by reading in a pdf and parsing them with 
        the SimpleDirectoryReader
        '''

        sdr = SimpleDirectoryReader(self.data_dir)
        self.chunks = sdr.load_data()
        storage_context = StorageContext.from_defaults(vector_store=vectorstore)

        self.index = VectorStoreIndex.from_documents(
            self.chunks, storage_context, embed_model=Settings.embed_model
        )
        return

    def get_retriever(self, topk=3):
        '''
        From a VectorStoreIndex, get a VectorIndexRetriever
        '''
        return self.index.as_retriever(similarity_top_k=topk)

A class to evaluate the Mean Average Precision:

class MyEvaluator:
    def __init__(self, chunks) -> None:
        '''
        Initialize MyEvaluator by selecting random chunks
        from the total chunks, generating queries about these chunks using the llm, 
        and creating a list of tuples of the form (query, [chunk_ids relevant to query]).
        '''
        self.random_chunks = random.sample(chunks, 50)
        self.queries = self._generate_queries(self.random_chunks)
        self.q_expected_ids = self._make_expected_id_pairs(self.random_chunks)

    def _generate_queries(self, random_chunks):
        '''
        generate queries from chunks.
        '''

        n_questions_per_chunk = 1

        data_generator = RagDatasetGenerator.from_documents(
            random_chunks, 
            num_questions_per_chunk=n_questions_per_chunk,
            show_progress=True,
            llm=Settings.llm,
        )

        eval_questions = data_generator.generate_questions_from_nodes()
        return eval_questions

    def _make_expected_id_pairs(self, random_chunks):
        '''
        format the inputs into a list of tuples 
        of the form (query, [chunk_ids relevant to query]).
        '''
        q_expected_id_list = []

        for q in self.queries.examples:
            chunk_id = self._get_chunk_id(random_chunks, q.reference_contexts[0])
            tmp_chunk_ids = self._get_random_chunk_ids(random_chunks)
            chunk_ids = [chunk_id]
            chunk_ids.extend(tmp_chunk_ids)
            q_expected_id_list.append((q.query, chunk_ids))

        return q_expected_id_list

    def _get_chunk_id(self, chunks, check_str):
        '''
        for a given string from a chunk, get the chunk id of it.
        '''
        for chunk in chunks:
            if chunk.get_text() == check_str and chunk.doc_id != None:
                return chunk.doc_id
            else:
                return '999999'

    def _get_random_chunk_ids(self, random_chunks):
        '''
        this is just to make the chunk_ids list longer.
        '''
        randomlist = random.sample(range(0, 50), 30)
        random_chunk_id_list = [random_chunks[ii].doc_id for ii in randomlist]
        return list(set(random_chunk_id_list))

    def evaluate(self, retriever):
        '''
        calcualte the mean average precision
        '''
        retriever_eval = RetrieverEvaluator.from_metric_names(
            ["ap"], retriever=retriever
        )

        avg_precisions_vals = []
        tmp_avgs = []
        for eval_tup in self.q_expected_ids:
            ap = retriever_eval.evaluate(eval_tup[0], eval_tup[1])
            avg_precisions_vals.append(ap.metric_vals_dict["ap"])
            tmp_avgs.append(ap.metric_vals_dict)

        mapscore = np.mean(avg_precisions_vals)
        return mapscore

Main:

if __name__ == "__main__":
    flare = MyFlareQueryEngine()
    eval = MyEvaluator(flare.chunks)

    retriever = flare.get_retriever(topk=len(flare.chunks))
    map = eval.evaluate(retriever)
    print('map = {}'.format(map))

    flare.client.close()
    print("DONE!")

What I end up getting is map = 0. When I investigated this further, I noticed that none of the retrieved_ids that come from BaseRetrievalEvaluator.aevaluate() are in the expected_ids of that class. I would have expected that the retrieved_ids get pulled from the weaviate database somehow.

Any help is much appreciated. Thanks in advance.

dosubot[bot] commented 1 month ago

The issue you're encountering, where the Mean Average Precision (MAP) evaluates to zero, is likely due to a mismatch between the retrieved_ids and expected_ids. Here are a few potential reasons and suggestions to address this:

  1. Document ID Consistency: Ensure that the document IDs (doc_id) used in your expected_ids are consistent with those stored in the Weaviate vector store. If there's any discrepancy in how IDs are assigned or retrieved, it could lead to mismatches.

  2. Query and Retrieval Logic: Verify that the queries generated by RagDatasetGenerator are correctly aligned with the documents in your vector store. The queries should be relevant to the content of the documents to ensure that the retriever can find the correct matches.

  3. Retriever Configuration: Check the configuration of your retriever. Ensure that the similarity_top_k parameter is set appropriately to retrieve a sufficient number of relevant documents. If it's too low, relevant documents might not be retrieved.

  4. Chunk ID Retrieval: In your _get_chunk_id method, ensure that the logic for matching text to chunk IDs is correct. The method should accurately find the corresponding doc_id for the given text. The return of '999999' might indicate a fallback or error case, which could lead to incorrect expected_ids.

  5. Debugging: Add logging or print statements to inspect the retrieved_ids and expected_ids during evaluation. This can help identify where the mismatch occurs.

By addressing these areas, you should be able to identify why the retrieved_ids are not matching the expected_ids and adjust your implementation accordingly.

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@aclifton314 the retrieved ids and expected ids are based on the node ids from your index.

When you load documents, documents get split into chunks, and .generate_questions_from_nodes is generating questing for each chunk, with the assumption that retrieving with that question should return the associated node id in the top k

However, it seems like flare.chunks is pointing to the input documents rather than the actual nodes

A setup like this will probably work

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=128)

...

sdr = SimpleDirectoryReader(self.data_dir)
documents = sdr.load_data()
self.chunks = splitter(documents)
storage_context = StorageContext.from_defaults(vector_store=vectorstore)

self.index = VectorStoreIndex(
    nodes=self.chunks, storage_context=storage_context, embed_model=Settings.embed_model
)
aclifton314 commented 1 month ago

@logan-markewich I think I see what you are saying. A Document chunk has it's own id while a Node chunk can have a different id. Since .generate_questions_from_nodes is expecting Node objects to retrieve ids from and my expected_ids (as eval_tup[1]) has ids from Document objects, that would account for the different in id lists. Have I understood correctly?