Standardize Vector Search Retrievers

jhyearsley commented 10 months ago

I'm working on a PR to integrate MongoDB Atlas Vector Search and it's not clear to me where the responsibility for embedding should be. I see there is the SentenceVectorizer and OpenAIVectorizer classes, shouldn't we just extend that code and make embedding more of a first class citizen in DSPy? I see that some of the other integrations (pinecone and chromadb) actually import openai which seems strange to me given that DSPy already has code to integrate with OpenAI?

Another point of confusion is the documentation for DSPy does not really mention embeddings at all. We have the concept of a Retriever and RAG is mentioned several times but it's not obvious how RAG should work without embeddings. From reading through the code, I see query getting passed into custom modules for RAG and ChainOfThought but I don't see any embedding which is quite confusing. Is the embedding supposed to be the responsibility of the user?

Can someone explain the vision for how embeddings and retrieval should be handled in DSPy? I am happy to make PRs to extend the OpenAIVectorizer class to support Azure OpenAI and help with documentation to make things more clear, but before doing so I'd like to better understand the longer term vision for how this library should be used.

okhat commented 10 months ago

Yeah, SentenceVectorizer, OpenAIVectorizer, and a few others are all some nice integrations by independent contributors.

I'm honestly open to changing my mind, but here's my perspective: indexing (embedding) documents belongs outside DSPy. There are many tools that do it well (e.g., ColBERT, SentenceTransformers, and LlamaIndex are all great for different things).

What should belong inside DSPy are lightweight clients to issue queries to the pre-indexed search collections.

These lightweight wrappers can take the same form as Pyserini here: https://github.com/stanfordnlp/dspy/blob/a08b4ac0aa714862fa5cbe2c8114b77f4de37ea8/dsp/modules/pyserini.py#L8

What do you think? Does that help for MongoDB Atlas?

okhat commented 10 months ago

@jhyearsley Any thoughts on this?

jhyearsley commented 10 months ago

I'm honestly open to changing my mind, but here's my perspective: indexing (embedding) documents belongs outside DSPy.

I think it's a very reasonable perspective.

What should belong inside DSPy are lightweight clients to issue queries to the pre-indexed search collections.

I'm not sure I agree with this if we take the first statement seriously (which I think is probably correct). From my perspective the code to query / embed / chunk all tend to be quite simple with existing libraries (or drivers in the case of vector DBs) so why include any part unless including all the others?

okhat commented 10 months ago

Hmm your last sentence can be understood in two opposite ways: for/against.

Basically I’m suggesting that any non-trivial retrieval code doesn’t belong inside dspy, neither for indexing nor for search.

But lightweight clients that can wrap the search process can/should be in dspy, like pinecone, etc. because it needs to be a DSPy module.

jhyearsley commented 10 months ago

I thought pinecone only stored vectors, why would you return vectors to DSPy? Am I missing a bigger picture idea?

okhat commented 10 months ago

Basically whatever service or external index is out there, dspy’s role is to enable in the most lightweight way possible a module that takes queries and returns top passages.

Vectors aren’t exposed to the dspy user in general, and if they could be avoided somehow internally (eg as we do for colbert) we can/should avoid them.

Some modules need to encode queries though, so that much is fine.

jhyearsley commented 10 months ago

Basically whatever service or external index is out there, dspy’s role is to enable in the most lightweight way possible a module that takes queries and returns top passages.

This is a good summary and should be easy to comply with, appreciate you engaging.

So in my case MongoDB Atlas will take a query and return top passages (which will require an embedder). This means I need a dspy.Retrieve module. All good up to this point. Where I get confused is why should this custom module import more dependencies into dspy if the dependencies already exist in the project (e.g. openai). And then what if I decide to switch embedding providers, should I be adding more custom logic to allow the provider to be passed to the constructor? I think your suggestion of taking inspiration from Pyserini comes into play here.

But probably my confusion is too abstract at this point to be productive. How about I raise the PR I previously mentioned to make things more concrete and we continue the discussion there 😃

Feel free to close the issue unless you want to add anything else

fabiannagel commented 7 months ago

I'm new to dspy, but here are some thoughts:

Custom indexing logic determines a certain database schema, and if this logic does not belong into DSPy (I agree), neither does the retrieval part (IMHO) - especially when retrievers are opinionated about the database scheme they are expecting. I don't see why one would want to introduce all kinds of client dependencies, just to wrap their calls and potentially introduce inflexibility for the sake of convenience for a specific use case.

For example, PineconeRM expects full text inside the metadata, and the name of the property is not configurable. More importantly, the entire class is hard-coded towards OpenAI or local models (and even those with some very specific assumptions). I see the same thing in other retrieval classes.

My own use case (using Pinecone) varies to the above along many dimensions, and I will just end up writing my own retriever. What I would love to see as a developer is a precise interface of some sort, which defines the query: str -> dspy.Prediction relation that dspy expects. I don't see the need for dspy to support any kind of vector storage out there.

edit: Alright I see the interface is there in form of dspy.Retrieve, although it's not 100% clear to me what the default forward method implements here, and that this is the one to override. Some custom retrievers do it anyway, but ones like ColBERTv2 don't even inherit from dspy.Retrieve at all...?

lautaro450 commented 4 months ago

Hi @jhyearsley @okhat I'm new into dspy and I was wondering if mongo atlas with vector store was available as retriever, however I was unable to find any documentation regarding this topic besides this post. Is it something already available or not yet?

stanfordnlp / dspy

Standardize Vector Search Retrievers #250