joaopalotti / trectools

A simple toolkit to process TREC files in Python.
https://pypi.python.org/pypi/trectools
BSD 3-Clause "New" or "Revised" License
167 stars 31 forks source link

Fusion using combos #35

Closed Fatima-Haouari closed 2 years ago

Fatima-Haouari commented 2 years ago

Hi, I was trying to get fused runs. I managed to do it perfectly fine with reciprocal_rank_fusion with the example you showed, but when trying combos function the first issue I noticed is that it does not return a TrecRun object as reciprocal_rank_fusion do so I had to convert to a TrecRun object myself. fused_run=TrecRun(fused_run) but I am getting the below error

   f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Kindly advise.

joaopalotti commented 2 years ago

Hi @Fatima-Haouari, thanks for your message. Good to see people from Qatar around. I am currently your neighborhood in QCRI. Could you please share more details about your problem. The best would be if you could provide a short code that shows the error. Thanks

Fatima-Haouari commented 2 years ago

Thanks for your response. Nice to hear you are our neighbor. Please find below the code I am using

from trectools import TrecRun,TrecQrel, TrecEval, fusion

r1 = TrecRun("my_run1")
r2 = TrecRun("my_run2")
qrels= TrecQrel("my_qrels")
fused_run = fusion.combos([r1,r2],strategy="mnz")
fused_run=TrecRun(fused_run)
r1_p10 = TrecEval(r1, qrels).get_precision(depth=10)          
r2_p10 = TrecEval(r2, qrels).get_precision(depth=10)          
fused_run_p10 = TrecEval(fused_run, qrels).get_precision(depth=10)   
print("P@10-- Run 1: %.3f, Run 2: %.3f, Fusion Run: %.3f" % (r1_p10, r2_p10, fused_run_p10))
fused_run.print_subset("my_fused_run.txt", topics=fused_run.topics())

Please find below the error I am getting

   fused_run=TrecRun(fused_run)
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/trectools/trec_run.py", line 21, in __init__
    if filename:
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/pandas/core/generic.py", line 1330, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
joaopalotti commented 2 years ago

Hi Fatima, thanks for reporting this issue. I have just modified the interface of this module to follow the same pattern. You can use it with the following code:

from trectools import TrecRun,TrecQrel, TrecEval, fusion

r1 = TrecRun("my_run1")
r2 = TrecRun("my_run2")
qrels= TrecQrel("my_qrels")

fused_run = fusion.combos([r1, r2], strategy="mnz")
# or fused_run = fusion.rank_biased_precision_fusion([r1, r2])
# or fused_run = fusion.vector_space_fusion([r1, r2])
# or fused_run = fusion.reciprocal_rank_fusion([r1, r2])

r1_p10 = TrecEval(r1, qrels).get_precision(depth=10)
r2_p10 = TrecEval(r2, qrels).get_precision(depth=10)

fused_run_p10 = TrecEval(fused_run, qrels).get_precision(depth=10)
print("P@10-- Run 1: %.3f, Run 2: %.3f, Fusion Run: %.3f" % (r1_p10, r2_p10, fused_run_p10))

fused_run.print_subset("my_fused_run.txt", topics=fused_run.topics())

Please dont forget to update trectools first. I also added a few todos on ideas of how to clean the code and make it more pandas-like. If you are up to it, please feel free to contribute.

Best,

Joao

Fatima-Haouari commented 2 years ago

Thanks a lot for your help. I managed to get the fused runs now. However, I have an issue with the saved runs, It seems the print_subset function have an issue with the documents IDs format when the document ID is a long sequence. Please see an example below. I think they need to be saved as strings not floats.

938526354907201539 Q0 2447462545.0 1 68.17239761352539 comb_mnz 938526354907201539 Q0 2573395934.0 2 59.64200019836426 comb_mnz 938526354907201539 Q0 8.419854421992653e+17 3 47.85719871520996 comb_mnz 938526354907201539 Q0 1.2399882500827095e+18 4 47.61159896850586 comb_mnz 938526354907201539 Q0 66183082.0 5 46.91860008239746 comb_mnz 938526354907201539 Q0 1.1214384475945533e+18 6 45.5049991607666 comb_mnz

joaopalotti commented 2 years ago

Interesting, Fatima. I wrote a patch for this issue. Could you please check if this is fixed with the latest code? Thanks for reporting it!

Fatima-Haouari commented 2 years ago

Thanks for your quick response. Unfortunately I am getting the below error now.

    r1_p10 = TrecEval(r1, qrels).get_precision(depth=10)
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/trectools/trec_eval.py", line 670, in get_precision
    merged = pd.merge(run[["query", "docid", "score"]], qrels[["query","docid","rel"]], how="left")
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 87, in merge
    validate=validate,
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 656, in __init__
    self._maybe_coerce_merge_keys()
  File "/ds/usr/fatima/.conda/envs/myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1165, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
joaopalotti commented 2 years ago

Thanks again, Fatima. Forcing the docid to be a string will have a few consequences as the one that you described. I pushed another patch now. If you face another error, could you please send me the files by email so I can test them quickly here? Thanks!

Fatima-Haouari commented 2 years ago

Thanks a lot I really appreciate it. It worked perfectly fine now.

joaopalotti commented 2 years ago

Great, glad to hear that! Thanks for using TrecTools and feel free to contribute if you would like to!

Fatima-Haouari commented 2 years ago

Thanks for your efforts. TrecTools is really useful for my research, and I would be happy to contribute to this great work.