stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy.ai
MIT License
18.76k stars 1.44k forks source link

Can you clarify what role the ColBERT server plays? #129

Closed jhyearsley closed 10 months ago

jhyearsley commented 1 year ago

It's not obvious to me from the documentation -- is a ColBERT server required to use DSPy? I think the answer is yes, but looking for a definitive answer. No matter the answer I think it can be made more clear how ColBERT fits into the bigger picture system design.

And more generally speaking I think the documentation can be improved to describe the bigger picture concepts more clearly. I definitely can buy the major premises of the library but I'm having trouble understanding the lifecycle of a query and index and how that depends on an LLM, vector database, and ColBERT. I would like to see a simpler description of what is happening behind the scenes, e.g. a made up explanation would be something like

during training, question-answer pairs are sampled from the dataset (how?) and prompts are updated (how?) automatically to maximize some cost function (how?). during inference, a question is sent to LLM for embedding and then compared against a vector db (which vector DB, is ColBERT involved?) and a ranking algorithm is applied to do x,y,z...)

I'd recommend creating a higher level sequence (system) diagram. Happy to help make this happen, I just need to figure out the answers myself first :)

okhat commented 1 year ago

Hey @jhyearsley !

Great question. This does seem confusing right now.

ColBERT is just a retriever. (It happens to be one that we built and one that we like, but that's all.)

This server hosts a search index over Wikipedia. If you need to search over Wikipedia, that's pretty handy. If you don't need to search over Wikipedia, you don't need it at all.

okhat commented 1 year ago

The simplest explanation I wrote of DSPy was on HackerNews.

It said this:

You give DSPy (1) your free-form code with declarative calls to LMs, (2) a few inputs [labels optional], and (3) some validation metric [e.g., sanity checks].

It simulates your code on the inputs. When there's an LM call, it will make one or more simple zero-shot calls that respect your declarative signature. Think of this like a more general form of "function calling" if you will. It's just trying out things to see what passes your validation logic, but it's a highly-constrained search process.

The constraints enforced by the signature (per LM call) and the validation metric allow the compiler [with some metaprogramming tricks] to gather "good" and "bad" examples of execution for every step in which your code calls an LM. Even if you have no labels for it, because you're just exploring different pipelines. (Who has time to label each step?)

For now, we throw away the bad examples. The good examples become potential demonstrations. The compiler can now do an optimization process to find the best combination of these automatically bootstrapped demonstrations in the prompts. Maybe the best on average, maybe (in principle) the predicted best for a specific input. There's no magic here, it's just optimizing your metric.

The same bootstrapping logic lends itself (with more internal metaprogramming tricks, which you don't need to worry about) to finetuning models for your LM calls, instead of prompting.

In practice, this works really well because even tiny LMs can do powerful things when they see a few well-selected examples.


Does this sound useful?

jhyearsley commented 1 year ago

@okhat yes, this high level description is super helpful! thanks for quick response, much appreciated. The part that is still a bit hazy to me is how DSPy is interacting with the LLM and the Vector DB during the bootstrapping phase. I understand my chunked data would be stored in a Vector DB, and that an LLM can be used to augment retrieval from that Vector DB but I'm having a hard time visualizing how and when DSPy is integrating with the LLM and Vector DB (e.g during bootstrapping is DSPy reading / writing data to the Vector DB?)

Another point of clarification on ColBERT as a "retriever" -- my impression from skimming through the ColBERTv2 paper is that there is some additional secret sauce re: single-vector representations vs multi-vector representations. Does that mean if I want the sauce I need to either use ColBERT or reimplement the methods with my own Vector DB? Or is the sauce built into DSPy no matter what?

I saw another issue which mentioned adding Pinecone and I'd like to add MongoDB Atlas Vector Search as an integration to the library (since this is the Vector DB I will be using while evaluating the tool).

karrtikiyer commented 1 year ago

@jhyearsley : you can look at this example to see how to make this work without ColBERT and with our own data: https://github.com/stanfordnlp/dspy/blob/main/pyserini.ipynb

karrtikiyer commented 1 year ago

@okhat yes, this high level description is super helpful! thanks for quick response, much appreciated. The part that is still a bit hazy to me is how DSPy is interacting with the LLM and the Vector DB during the bootstrapping phase. I understand my chunked data would be stored in a Vector DB, and that an LLM can be used to augment retrieval from that Vector DB but I'm having a hard time visualizing how and when DSPy is integrating with the LLM and Vector DB (e.g during bootstrapping is DSPy reading / writing data to the Vector DB?)

Another point of clarification on ColBERT as a "retriever" -- my impression from skimming through the ColBERTv2 paper is that there is some additional secret sauce re: single-vector representations vs multi-vector representations. Does that mean if I want the sauce I need to either use ColBERT or reimplement the methods with my own Vector DB? Or is the sauce built into DSPy no matter what?

I saw another issue which mentioned adding Pinecone and I'd like to add MongoDB Atlas Vector Search as an integration to the library (since this is the Vector DB I will be using while evaluating the tool).

I think many DSP.Retriever integrations are getting added, I saw one for PineCone as well for which there was a merge request.

okhat commented 1 year ago

Yes we have support for several different retrievers. ColBERT is not essential to DSPy in any way. We just like it!

okhat commented 1 year ago

during bootstrapping is DSPy reading / writing data to the Vector DB

We never write to a vector DB internally. You can do that to add documents to search, if you want.

We will only search things in the vector DB / retrieval index. But a lot of read/writes happens to the cache on disk and to some internal representations in the compiler (in memory).