terrierteam / pyterrier_colbert

83 stars 35 forks source link

different scorer performs differently #61

Open Xiao0728 opened 1 year ago

Xiao0728 commented 1 year ago

different scorer: factory.scorer(), factory.text_scorer() and factory.index_scorer() generate different embeddings for the ReRanking scenario, causing the performance difference in terms of nDCG@10 and MAP@1k. Pipelines tested:

pipe1 =(factory.query_encoder()  
        >> bm25_terrier_stemmed_text
        >>pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
        >>factory.text_encoder()
        >>factory.scorer())
pipe2 = (bm25_terrier_stemmed_text
         >>pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
         >> factory.text_scorer())

pipe3 = (bm25_terrier_stemmed_text>>factory.index_scorer())
from pyterrier.measures import *
df = pt.Experiment(
    [pipe1, pipe2,pipe3],
    topics2019,
    qrels2019,
    batch_size=10, 
    verbose=True,
    filter_by_qrels=True,
    eval_metrics=[nDCG@10,RR(rel=2)@10,  AP(rel=2)@1000, R(rel=2)@1000],
    names=["pipe1","pipe2","pipe3"]
)
image
seanmacavaney commented 1 year ago

To some extent, these types of differences are commonplace with anything neural -- there's often some floating point error introduced on GPUs due to the order that operations are executed.

The embeddings in the index may be slightly different than those computed on the fly. For instance, there could be a different amount of padding in each case, causing operations to execute in a different order. Or the embeddings could be in a different position within the batch.

You could check the document embeddings in each case to see if they match.

(Since queries in ColBERT are always padded to the same length, I suspect it's due to differences in the query embeddings.)

cmacdonald commented 1 year ago

Padding is one possiblility we were looking at. We're simply recording here so we can check back on it.