How to use Dirichlet smoothing parameter?

terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

https://pyterrier.readthedocs.io/

Mozilla Public License 2.0

397 stars 63 forks source link

How to use Dirichlet smoothing parameter? #443

Open eyasu11321238a opened 2 months ago

eyasu11321238a commented 2 months ago

I am conducting an IR experiment using the Dirichlet model, and I need help improving the result using a smoothing parameter.

Dirichlet = pt.BatchRetrieve(index_path, wmodel="DirichletLM", c=2000) #controls={"mu": 2000}

i use smoothing parameter, but still the result is the same.

seanmacavaney commented 2 months ago

This should do the trick!

pt.BatchRetrieve(index_path, wmodel="DirichletLM", controls={'dirichletlm.mu': 2000})

We know the control names are not particularly well-documented, but it's something we have an open issue for :-) https://github.com/terrier-org/terrier-core/issues/197

hscells commented 2 months ago

Hey Sean, is it also possible to change the query term probability function in the DirichletLM weighting model? We'd like to study different smoothing strategies.

cmacdonald commented 2 months ago

There is a very easy way of writing your own weighting model: https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html#custom-weighting-models where you pass a lambda function to BatchRetrieve constructor.

However, its very slow (it has to be cross the JNI boundary for every posting scored).

Craig

hscells commented 2 months ago

That looks great, thank you!

cmacdonald commented 2 months ago

Otherwise, if you can compile your own weighting model in Java, you can add it to the classpath.

For instance, there is a BM25_log10_nonum weighting model in https://github.com/terrierteam/terrier-ciff. It can be used directly like this:

pt.init(packages=["com.github.terrierteam:terrier-ciff:-SNAPSHOT"])
br = pt.BatchRetrieve(index, wmodel="BM25_log10_nonum")
# or, if the fully qualified name was different
# br = pt.BatchRetrieve(index, wmodel="org.terrier.matching.models.BM25_log10_nonum")

(where com.github is an automatic Github to Maven gateway provided built by jitpack).

Alternatively, if you mvn install the package locally, and then it would be available with pt.init(packages=['org.terrier:terrier-ciff:0.2']). The Jitpack integration is just handy for importing something in github without needing to formally release to Maven.

Craig