stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.19k stars 1.46k forks source link

Use of a Retriever with the attribute with_metadata leads to an unexpected behaviour. #1373

Open Gwenn-LR opened 3 months ago

Gwenn-LR commented 3 months ago

Description

While following the tutorial [02] Multi-Hop Question Answering, adapted to my system (that is to say a LM hosted and provided by a local Ollama server and a RM locally hosted via Chroma), I could not get interpretable scores since I had not the same database structure : I decided, since there were no indication in your tutorial, to add the title of each wiki page as a metadata of each corresponding chunk while (I think) you added it as part of the context. So metric based on the comparison between gold_titles examples and normalized_text from context could not achieve a satisfactory score as you dit in your case. That's why I've tried to add metadata to my Prediction and it's where issues appeared.

Package version

python: 3.10.12 dspy-ai: 2.4.13

Issue

First, when I call the Retriever with set attribute with_metadata set to True, it calls dsp.primitives.search.retrieveEnsemblewithMetadata which calls itself dsp.primitives.search.retrieveRerankEnsemblewithMetadata when there is no reranker attribute to dsp.settings, which itself raise a AssertionError: Both RM and Reranker are needed to retrieve & re-rank. since there is no reranker as tested just before.

Once this issue solved, the dsp.primitives.search.retrieveEnsemblewithMetadata method calls dsp.primitives.search.retrieve when there is only one query (which is my case) and it does not extract metadatas at all. I don't think any metadata are extracted with any methods from dsp.primitives.search.

Finally, I've tried to fix at least the method for my case and defined my passages variable as a dictionnary as indicated in your code:

https://github.com/stanfordnlp/dspy/blob/af5186cf07ab0b95d5a12690d5f7f90f202bc86e/dspy/retrieve/retrieve.py#L93C1-L94C63

However, dspy.retrieve.retrieve.single_query_passage seems to be written for multiple passages unlike what suggest its name and in my case it generate a Prediction with a list in a list as passages attribute which leads to an error when I try to clean my context + passages with dsp.utils.deduplicate (since a list can't be hashed).

Possible solution

I'll open a PR to solve those issues, I think the first one is just a typo, the second should respect your syntax but it might be tightly linked to the next issues I've faced so I would like to know if you could help me solve those. Thank you for your devoted attention to this matter.

TobiasGoerke commented 2 months ago

+1, thanks!. Really surprising to see you can't return any kind of metadata out of your retriever to e.g. show sources to the user..

Gwenn-LR commented 1 week ago

3 months without any answer from the team, I've just received a notification that my PR has been closed without any further explaination and I've checked, the issue does not seem to have been patched in the main branch.

jiange91 commented 19 hours ago

Facing the same issue here. I just want to know the prob and scores of retrieved docs. The RM code is still not fixed in the latest release.