hltcoe / patapsco

Cross language information retrieval pipeline
Other
18 stars 6 forks source link

Logging RM3 expansion terms in Patapsco #50

Open cramraj8 opened 1 year ago

cramraj8 commented 1 year ago

@dlawrie @cash , I am trying to find ways to look at the RM3's expansion terms for each topics. When I set rm3_logging: true, the command line shows the original query and new query, but there were not logged on any files in disk. How can I achieve this for post processing on those expansion terms ?

Also I am looking at how I can inject expansion terms into the RM3 framework without its default term selection. Looks like Patapsco's is using Pyserini's RM3 so that I have to make changes to Pyserini. Can you confirm is that the case or Patapsco already have a configuration for that ?

cash commented 1 year ago

@cramraj8 It looks like it is only logging those to stdout right now. With rm3 on and rm3_logging on, you should see something like this:

2022-09-13 14:47:08,299 INFO  [main] lib.Rm3Reranker (Rm3Reranker.java:100) - Original Query: (dissid)^1.0 (polit)^1.0 (prison)^1.0
2022-09-13 14:47:08,300 INFO  [main] lib.Rm3Reranker (Rm3Reranker.java:101) - Running new query: (kabila)^0.044330347 (dissid)^0.16666667 (convict)^0.044330347 (drc)^0.044330347 (peac)^0.039642878 (polit)^0.16666667 (amnesti)^0.07388391 (ceasefi)

Looking through my comments, I think I intended to log it to a file, but maybe didn't figure out how to get the expanded queries back from Pyserini. I think the rm3 implementation might be in Java (Anserini which is wrapped by Pyserini). I did play around with making a custom Anserini that was called by Patapsco, but didn't pursue it. You could implement your own rm3 in python - I'm assuming you want to weight the expanded terms based on the initial docs in the result set rather than pulling expanded terms from some other source (I did this latter approach for another experiment, but did it as a preprocessing step and updated the queries using the Lucene syntax with weights. Sorry - I'm rambling now. Let me know any specific questions you have on this.

cramraj8 commented 1 year ago

Got it. Thanks for the information. @cash "pulling expanded terms from some other source" - this is what I want to achieve. Since you already did that in the past, can you give some guides or references (or any code snippets) on that ?

cash commented 1 year ago

@cramraj8 Take a look at this config: samples/configs/eng_lucene_boolean_queries.yml It loads queries from this file: https://github.com/hltcoe/patapsco/blob/master/samples/data/eng_mini_lucene_queries.jsonl

We built this so that someone could externally preprocess the queries in arbitrary ways and include specific weights on terms using the default lucene syntax. I don't think it supports PSQ - where you expecting that?