stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy.ai
MIT License
18.71k stars 1.43k forks source link

[Proposal] Interface Metadata Filters in dspy.Retrieve classes #313

Open CShorten opened 9 months ago

CShorten commented 9 months ago

What?

Interface metadata filters in dspy.Retrieve classes such as the Weaviate / Mongo / Pinecone / Qdrant / Chroma RMs.

Why?

Symbolic filters can be used to improve vector search results.

For example, we may only want podcast clips about "DSPy" where the "speaker" is "Omar Khattab".

How?

  1. Add **kwargs to the overloaded forward call.
  2. Check if filters is passed in **kwargs
  3. If so, interface the filter with the respective python clients. For example, it would look like this in Weaviate:
# Note this example is with the Weaviate Python v3 client
# omitting parsing `filter` and the operator / valueType from the call to `forward`
# ^ This would be abstracted in the WeaviateRM class.
response = (
    client.query
    .get(self.collection_name, ["content"])
    .with_where({
        "path": [filter],
        "operator": parsedOperator,
        filterType: filterValue
    })
    .with_limit(3)
    .do()
)
print(json.dumps(response, indent=2))

So in the forward pass of DSPy Modules, we would see something like:

contexts = self.weaviate_retriever(query, filters={"speaker": "Omar Khattab"})

Additional Comments

In the future we may also want to interface a filter only without a search query. For example, if we want to see the most recent 2 podcasts without any sorting based on relevance to a query.

okhat commented 9 months ago

100%. Would love to support this. Does it need to be a general interface? I guess so because dspy.Retrieve is currently provider-agnostic.

The easy way to do this right away is to use the Weaviate class inside the module directly. That was it’s used as a tool, not as a retriever (not plug n play with other retrievers and will not receive any automatic optimization for retrieval if we ever add any). But currently all optimizers focus on the LM not the retriever itself, so there’s no real harm in this except interoperability.

What do you think Connor? Do we just pass **kwargs to the underlying provider anyway? And only some providers will implement all features like filters?

CShorten commented 9 months ago

[Tool vs. Retrieve]

Ah, I think a very interesting distinction could be made between dspy.Retrieve and dspy.Tool.

Can you please catch me up on how the Python Interpreter is interfaced? Or maybe a simple calculator is a better example.

Maybe Retrieve inherits Tool?

[kwargs]

I think **kwargs gives us the most flexibility to, exactly as you mention, offer features supported in some retrievers and not others while keeping a fairly standard interface. I think if we did it this way it would have the lowest risk of breaking changes with whatever we want to do next.

[Interface with DSPy Compiler]

I am imagining you could optimize a metadata filter with something like this:

class QueryToFilter(dspy.Module):
  def __init__(self):
    # Probably better to use a Signature for this one that describes the cardinality of the filter
    self.query_to_filter = dspy.Predict("query -> metadata_filter_speaker")
    self.retrieve = dspy.Retrieve(k=3)
    # ...

  def forward(self, question):
    filter = self.query_to_filter(question).metadata_filter_speaker
    # Interface fitler cardinality with DSPy Assertions
    dspy.Assert(filter, "Filter must be one of ['Omar Khattab', 'Bob van Luijt', 'Etienne Dilocker', ...]")
    contexts = self.retrieve(query=question, filter=filter)
    # ...