terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/
https://pyterrier.readthedocs.io/
Mozilla Public License 2.0
397 stars 63 forks source link

Tuning BM25F parameters #397

Closed Watheq9 closed 8 months ago

Watheq9 commented 10 months ago

Hi @cmacdonald,

I was trying to tune BM25F parameters. Per the documentation, BM25F is implemented, as described by [Zaragoza TREC-2004]. In Zaragoza's paper, there are 'b' and 'w' parameters per field, and one 'k' global parameter. My questions are as follows:

  1. I figured out that 'b' parameter is actually named 'c' in terrier, and 'w' corresponds to 'w.i' where i is the field number (starting from 0). So, is this mapping correct?
b = 1
bm25f = pt.BatchRetrieve(index, wmodel='BM25F', 
                                    controls={'w.0': 1.0, 'w.1': 0.5, 
                                                'c.0': b, 'c.1': b}, 
                                    verbose=True)
  1. For 'k1' parameter, I could not find the corresponding name. So, could you please let me know what it is?

Just copying my supervisor Dr. @JMMackenzie

cmacdonald commented 10 months ago

(1) yes, this looks right (2) I dont think we have every tuned k1 in BM25F. 6 parameters was always enough!

Watheq9 commented 10 months ago

Thanks @cmacdonald, for your reply! What are the 6 parameters?

cmacdonald commented 10 months ago

What are the 6 parameters?

normalisation i.e. b (c.f. c) values for each field and the weight.

JMMackenzie commented 10 months ago

Hey Craig, thanks for the help!

Just double checking - does this mean your (Terrier) BM25F doesn't include k? Or it's just not exposed?

cmacdonald commented 10 months ago

not exposed in BM25F, while it is in BM25.

See https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/BM25.java#L45 vs https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/basicmodel/BM.java#L48

Watheq9 commented 10 months ago

Pardon @cmacdonald, but what are the normalization parameters which are exposed in pyterrier, other than 'c'? I tried to set 'b' and 'b.0' to multiple values, but none of them changed anything in the performance. If I am not mistaken, the exposed parameters are just 'c' and the weight for each field. Please correct me if I am wrong.

cmacdonald commented 9 months ago

If I am not mistaken, the exposed parameters are just 'c' and the weight for each field. Please correct me if I am wrong.

I'm not sure I follow the question. For BM25F, this is correct, right...?

bm25f = pt.BatchRetrieve(index, wmodel='BM25F', 
                                    controls={'w.0': 1.0, 'w.1': 0.5, 
                                                'c.0': b, 'c.1': b}, 
                                    verbose=True)
cmacdonald commented 9 months ago

Any update guys, or can I close the issue?

JMMackenzie commented 8 months ago

I think we've got it figured out now, thanks for the help! We'll get back to you if we need to re-open.