stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
16.88k stars 1.3k forks source link

Inaccurate retrieval when dspy.ColBERTv2RetrieverLocal is used with keyword argument load_only set to True. #1452

Open adishinde2110 opened 2 weeks ago

adishinde2110 commented 2 weeks ago

Case1: I build an index with dspy.ColBERTv2RetrieverLocal and set it as my retrieval model. (working correctly) Code:

colbert_config = ColBERTConfig(
    checkpoint='colbert-ir/colbertv2.0',
    index_name='dude_batch_1',
    experiment='my_db',
    nranks=1
)
colbertv2 = dspy.ColBERTv2RetrieverLocal(passages=context_list, colbert_config=colbert_config) #by default load_only=False
turbo = dspy.OpenAI(model='gpt-4o-mini')
dspy.settings.configure(lm=turbo, rm=colbertv2)

Case 2: Now for future experiments with the same colbert_config and indexed data I use the load_only keyword argument and set it as my retrieval model. (inaccurate retrieval) Code:

colbert_config = ColBERTConfig(
    checkpoint='colbert-ir/colbertv2.0',
    index_name='dude_batch_1',
    experiment='my_db',
    nranks=1
)
colbertv2 = dspy.ColBERTv2RetrieverLocal(passages=context_list, colbert_config=colbert_config, **load_only=True**)
turbo = dspy.OpenAI(model='gpt-4o-mini')
dspy.settings.configure(lm=turbo, rm=colbertv2)

But this gives me inaccurate retrieval on the same data.

arnavsinghvi11 commented 8 hours ago

Hi @adishinde2110 , can you give some more details on what "inaccurate retrieval" means here? Does using load_only=True not identify the correct index?

adishinde2110 commented 6 hours ago

The passages or chunks retrieved in both cases are different, and seem to be inaccurate in case 2 where I am using load_only=True.

Below is the code where I set and use the retrieval model:

turbo = dspy.OpenAI(model='gpt-4o-mini')
dspy.settings.configure(lm=turbo, rm=colbertv2_dude)
#Define Retrieve Module
retriever = dspy.Retrieve(k=3)
query='Who is the mother of the director of film Polish-Russian War (Film)?'
# Call the retriever on a particular query.
topK_passages = retriever(query).passages
print(f"Top {retriever.k} passages for question: {query} \n", '-' * 30, '\n')
for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')

Outputs for both cases:

For Case 1: where I build an index with dspy.ColBERTv2RetrieverLocal and set it as my retrieval model. (working correctly)

 Top 3 passages for question: Who is the mother of the director of film Polish-Russian War (Film)? 
 ------------------------------ 
1] Polish-Russian War (Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska. 
2] Eldon Howard was a British screenwriter. She was the mother- in- law of Edward J. Danziger and wrote a number of the screenplays for films by his company Danziger Productions.
3] Xawery Żuławski (born 22 December 1971 in Warsaw) is a Polish film director. In 1995 he graduated National Film School in Łódź. He is the son of actress Małgorzata Braunek and director Andrzej Żuławski. His second feature "Wojna polsko-ruska" (2009), adapted from the controversial best-selling novel by Dorota Masłowska, won First Prize in the New Polish Films competition at the 9th Era New Horizons Film Festival in Wrocław. In 2013, he stated he intends to direct a Polish novel "Zły" by Leopold Tyrmand. Żuławski and his wife Maria Strzelecka had 2 children together: son Kaj Żuławski (born 2002) and daughter Jagna Żuławska (born 2009).

For Case 2: with the same colbert_config and indexed data I use the load_only keyword argument and set it as my retrieval model. (inaccurate retrieval)

Top 3 passages for question: Who is the mother of the director of film Polish-Russian War (Film)? 
 ------------------------------ 
1] Beulah Anne Georges( May 10, 1923 – January 4, 2005) was a member of three women ’s professional baseball teams in the 1940s. 
2] Carl August Hugo Froelich ( 5 September 1875 – 12 February 1953) was a German film pioneer and film director. He was born and died in Berlin. 
3] Princess Irene of Greece and Denmark( 13 February 1904 – 15 April 1974) was the fifth child and second daughter of Constantine I of Greece and his wife, the former Princess Sophie of Prussia. She was a member of the royal families of Greece and Italy. From 1941 to 1943 she was also officially Queen Consort of Croatia. 

Here the retrieved passages or chunks in case 2 are different compared to case 1 and seem to be inaccurate for the input query.