stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
16.73k stars 1.29k forks source link

[Retrieval] Is the PineconeRM functional? #322

Open pa-t opened 7 months ago

pa-t commented 7 months ago

Overview

I wanted to test out this library for a project, but I have hit so many roadblocks that I do not think this library is even functional. Here is my code I have been testing with, built following documentation in this repo (I have replaced sensitive information throughout with ...)

import dspy
from dspy.evaluate import Evaluate
from dspy.evaluate.metrics import answer_exact_match
from dspy.retrieve.pinecone_rm import PineconeRM
from dspy.teleprompt import BootstrapFewShot, BootstrapFewShotWithRandomSearch

class RAG(dspy.Module):
    def __init__(self, num_passages: int = 3):
        super().__init__()
        # declare three modules: the retriever, a query generator, and an answer generator
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = dspy.ChainOfThought("question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question: str):
        # generate a search query from the question, and use it to retrieve passages
        search_query = self.generate_query(question=question).search_query
        passages = self.retrieve(query_or_queries=search_query).passages

        # generate an answer from the passages and the question
        return self.generate_answer(context=passages, question=question)

turbo = dspy.OpenAI(
    model="gpt-4-1106-preview",
    api_key="...")

retriever_model = PineconeRM(
    pinecone_index_name="...",
    pinecone_api_key="...",
    pinecone_env="...",
    openai_embed_model="text-embedding-3-small",
    openai_api_key="...",
    k=3)

dspy.settings.configure(
    lm=turbo,
    rm=retriever_model,
)

train = [('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...')]
train = [dspy.Example(question=question, answer=answer).with_inputs('question') for question, answer in train]

dev = [('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...'), ('...', '...')]
dev = [dspy.Example(question=question, answer=answer).with_inputs('question') for question, answer in dev]

teleprompter = BootstrapFewShot(metric=answer_exact_match, max_bootstrapped_demos=2)
teleprompter2 = BootstrapFewShotWithRandomSearch(metric=answer_exact_match, max_bootstrapped_demos=2, num_candidate_programs=8, num_threads=32)

rag_compiled = teleprompter.compile(RAG(), trainset=train)

rag_compiled2 = teleprompter2.compile(RAG(), trainset=train, valset=dev)

evaluate_hotpot = Evaluate(devset=dev, metric=answer_exact_match, num_threads=32, display_progress=True, display_table=15)
evaluate_hotpot(rag_compiled)
evaluate_hotpot(rag_compiled2)
rag_compiled("...")
rag_compiled2("...")

Shortlist of errors encountered

Improper check

Filename: dspy/retrieve/pinecone_rm.py Function name: _get_embeddings() Description: Before checking if the user has set self.use_local_model, the function is checking if torch is installed. This dependency is not needed if the user is using OpenAI embeddings, however with this current logic DSPy is forcing users to have it installed no matter what. Reordering this function like below solves this problem

if not self.use_local_model:
    if OPENAI_LEGACY:
        embedding = openai.Embedding.create(
            input=queries, model=self._openai_embed_model
        )
    else:
        embedding = openai.embeddings.create(
            input=queries, model=self._openai_embed_model
        ).model_dump()
    return [embedding["embedding"] for embedding in embedding["data"]]

try:
    import torch
except ImportError as exc:
    raise ModuleNotFoundError(
        "You need to install torch to use a local embedding model with PineconeRM."
    ) from exc

Parameters being passed incorrectly

Filename: dspy/retrieve/pinecone_rm.py Function name: forward() Description: Somewhere up the chain of calls (i believe in dsp/primitives/search.py -> retrieveEnsemble()) k is being passed to the forward() function of PineconeRM. I remedied this by adding an unused k to the function definition

def forward(self, query_or_queries: Union[str, List[str]], k: int) -> dspy.Prediction:

But this was just a short term fix as I was trying to get this script functional for testing.

Passages has no value long_text

Filename: dsp/primitives/search.py Function name: retrieve() Description: This line was encountering issues accessing the attribute .long_text

passages = [psg.long_text for psg in passages]

If the passages in this list are dictionaries, shouldn't we be using .get("long_text")? Either way, at this point each passage was a string in this list and I am no longer sure if my debugging is making things better or worse.

Looking for some guidance if I am wildly off base or if this PineconeRM is not operational. Thanks

Additional Info

My environment is using python 3.10.13 and i have dspy-ai==2.1.6 installed

okhat commented 7 months ago

Thanks for the deepdive! Pinecone is an external contribution, it might be outdated. So it’s very possible that you’re right about some challanges that require some improvements to it

uahmad235 commented 1 month ago

The issue hasn't been fixed yet. Better to implement your own retriever: https://dspy-docs.vercel.app/docs/deep-dive/retrieval_models_clients/custom-rm-client#the-dspythonic-way