stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
18.23k stars 1.39k forks source link

ColBERTv2 retriever not fetching appropriate passage for an obvious query #15

Closed abhinavkulkarni closed 1 year ago

abhinavkulkarni commented 1 year ago

Hi,

Thanks for this great project!

I was playing around with different prompts of my own within the DSP framework, and I am having trouble getting a correct answer to the following simple question:

Which team does the player named 2015 Diamond Head Classic’s MVP play for?

There is a Wikipedia page about the 2015 Diamond Head Classic (link). The phrase "2015 Diamond Head Classic" appears in the title as well as the abstract. The abstract also mentions "Buddy Hield" was named MVP.

However, the ColBERTv2 retriever is unable to retrieve the exact Wikipedia page in top 5 results. I checked the page's history and it was added in 2015, so it should have been present in the 2019 Wikipedia dump.

1st Hop

Write a search query that will help answer a complex question; if unsure, say "Not Found".
---
Follow the following format.

Question: «question to be answered»
Rationale: Let's think step by step. To answer this question, we first need to find out «the missing information»
Search Query: «a simple question for seeking the missing information»
---
Question: Which team does the player named 2015 Diamond Head Classic’s MVP play for?
Rationale: Let's think step by step. To answer this question, we first need to find out the player's name.
Search Query: "2015 Diamond Head Classic's MVP"

2nd Hop

Write a search query that will help answer a complex question; if unsure, say "Not Found".
---
Follow the following format:

Context: «sources that may contain relevant content»
Question: «question to be answered»
Rationale: Let's think step by step. Based on the context, we have learned the following. «information from context that provides useful clues»
Search Query: «a simple question for seeking remaining missing information»
---
Context:
«2015–16 Big Ten Conference men's basketball season | rankings Throughout the conference regular season, the Big Ten offices named one or two players of the week and one or two freshmen of the week each Monday. On November 17 in the Champions Classic, Denzel Valentine led Michigan State over Kansas by posting the first triple-double of the 2015–16 NCAA Division I men's basketball season with 29 points, 12 rebounds and 12 assists. On January 5, Diamond Stone was named national freshman of the week by the United States Basketball Writers Association. This table summarizes the head-to-head results between teams in conference play. Each team played 18 conference games,»
«Nelson Figueroa | Diamondbacks on December 21, 2012. He was released on April 26, 2013. Figueroa again signed with Taiwan's Uni-President 7-Eleven Lions in mid-2013. Figueroa has a brief but successful stint with the Lions in 2007, during which he was voted the MVP of Taiwan Series that year. On February 16, 2015, SNY announced that Figueroa would replace Bob Ojeda as the pre[/post-game](https://vscode-remote+ssh-002dremote-002babhinav-002dms-002d7a40.vscode-resource.vscode-cdn.net/post-game) analyst for their Mets broadcasts. Figueroa played as a pitcher for the Puerto Rican national team in the 2013 World Baseball Classic where he won a silver medal. Following the conclusion of the tournament, which was won by Dominican»
«Lucas Dias | Lucas Dias Lucas Dias Silva (born July 6, 1995) is a Brazilian professional basketball player who currently plays for Franca of the Novo Basquete Brasil (NBB). Dias was named Jordan Brand Classic International MVP in 2012. On April 21, 2015, it was announced that he would enter the 2015 NBA draft. However, he withdrew from the draft before the draft withdrawal deadline. Dias began his pro career with the Brazilian NBB League club E.C. Pinheiros. He was named the Brazilian League Revelation Player of the 2015–16 season. In 2016, he moved to the Brazilian club Paulistano. Dias represented Brazil at»
«2015 McDonald's All-American Boys Game | Kelley of the Bullis School in Potomac, Maryland coached the East team, while Robert Smith of Chicago's Simeon Career Academy coached the West team. The East defeated the West by a 111–91 score. Cheick Diallo earned MVP of the game after posting 18 points and 10 rebounds, for the East team. Five East team players (Diallo, Antonio Blakeney, Diamond Stone, Dwayne Bacon, and Isaiah Briscoe) and four West team players (Allonzo Trier, Brandon Ingram, P. J. Dozier, and Ivan Rabb) reached double figures in scoring. 2015 McDonald's All-American Boys Game The 2015 McDonald's All-American Boys Game is an All-star basketball»
«2015 MVP Cup | 2015 MVP Cup The 2015 Manny V. Pangilinan Cup, also known as the Master Game Face MVP Cup 2015 due to sponsorship reasons, was an invitational basketball tournament which was participated by four teams from September 11–13, 2015 at the Smart Araneta Coliseum. While a similarly named tournament was held in 2010, the 2010 MVP Invitational Champions' Cup, the 2015 MVP Cup is considered the inaugural edition of the MVP Cup and is planned to be held annually. The tournament was a single-round robin format and the champions were awarded $25,000. China, South Korea and Senegal were invited to join»
Question: Which team does the player named 2015 Diamond Head Classic’s MVP play for?
Rationale: Let's think step by step. Based on the context, we have learned the following. We need to find a player named 2015 Diamond Head Classic's MVP and which team he plays for.
Search Query: "2015 Diamond Head Classic's MVP" team

The subsequent hops cannot find the answer as the appropriate passage is not retrieved in the 2nd hop.

Thanks!

CC @okhat

okhat commented 1 year ago

Thanks for the report! I've looked into this. Basically, the page on "2015 Diamond Head Classic" (and also 2016, 2017, 2018) isn't in the downloaded corpus, possibly because the crawler/parser decided it's too short and removed it. It's the DPR wiki-100 corpus in case you'd like to directly use it.

In my experience whenever such a direct query fails to find the document, 90% of the time it's just not in the index (or, a bit less likely, the passage splitting is unfavorable).

okhat commented 1 year ago

Closing. But feel free to reopen if needed.

We're considering whether to host a 2023 Wikipedia index instead and to fix up some of the issues in the DPR corpus in it. Will this be helpful to you?

abhinavkulkarni commented 1 year ago

Hey @okhat,

A newer version of Wikipedia would undoubtedly help, but I am currently only trying out a few ideas; I could work with the 2019 corpus.

As in the DSP notebook, I couldn't find how to set up a remote ColBERTv2 server. All I could find on ColBERTv2 README was the Python API. Can you please elaborate more on that?

I am trying to set up a small ColBERTv2 server on a remote GPU-enabled machine and would like to query it from my laptop for experimentation.

Thanks!

okhat commented 1 year ago

This is actually a common request! A member of the team will merge a version soon. Could you just paste the same request in a new issue and I’ll forward that to him

abhinavkulkarni commented 1 year ago

Thanks @okhat, I have added the issue here: https://github.com/stanford-futuredata/ColBERT/issues/173