terrierteam / pyterrier_colbert

82 stars 35 forks source link

Can I include metadata in rerank method? #55

Open Zhou-Xun opened 1 year ago

Zhou-Xun commented 1 year ago

Hello,

I'm trying to use colbert.text_scorer() to do the rerank in the pipeline, but it seems that there is no option for me to include the metadata, and the output of colbert.text_scorer() only returns me the docno.

Therefore, even if my pipeline below have already included my text when doing bm25, I still need to go back to my data to match the text for each docno.

import pyterrier_colbert.ranking

colbert = pyterrier_colbert.ranking.ColBERTModelOnlyFactory(checkpoint)
pipe = (pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "text"])
            >> pt.text.sliding(text_attr='text', length=128, stride=64, prepend_attr=None)
            >> colbert.text_scorer()
            >> pt.text.max_passage())
cmacdonald commented 1 year ago

Thanks Xun for the report.

A workaround would be

def score_and_save_text(df):
  inner = colbert.text.scorer(df)
  return df[['qid', 'docno', 'text']].merge(inner, on=['qid', 'docno'])

and replace colbert.text_scorer() in the pipeline with pt.apply.generic(score_and_save_text)

I'll have a think about how to change the implementation to fix the underlying issue.