terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
412 stars 65 forks source link

Metrics are all zero with webis-touche2020 dataset #409

Closed Jia-py closed 11 months ago

Jia-py commented 11 months ago

Describe the bug

pt.init(version=5.7, helper_version="0.0.7")
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
retriever = pt.BatchRetrieve(index, wmodel='BM25', metadata=['docno','text'])

all the metrics are zero using bm25 in webis-touche2020 dataset, the same code works good for other datasets, such as beir/TREC-COVID.

To Reproduce Steps to reproduce the behavior:

  1. Which index - irds:beir/webis-touche2020/v2
  2. Which retrieval - bm25 with default parameters
  3. What pipeline - just bm25 retriver
  4. What was the dataframe output - it has the retrieved results, but with all metrics zero.
Jia-py commented 11 months ago
image
seanmacavaney commented 11 months ago

Hi @Jia-py -- thanks for reporting.

Terrier indexes have a maximum length for the fields that they store, which includes the docno. The default of 20 is often enough, but some datasets (such as touche) have longer docnos.

To change the maximum length, you'll need to set meta={"docno": 39} when indexing, as follows (the maximum docno is 39 characters in the dataset):

indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020_v2', meta={"docno": 39})

I hope this helps!

Jia-py commented 11 months ago

Hi @seanmacavaney, thanks for your time and help! It works well now.

seanmacavaney commented 11 months ago

No problem, happy to help :)