The judgment of tie is not handled appropriately

stjohn2007 commented 10 months ago

レポジトリの開発ありがとうございます！

使用してみて、引き分けの処理について気になったので質問させていただきます。引き分けの判定のときにモデルに[[C]]を出力をさせていますが、その後の処理でその結果を捨ててしまっているように思います。

https://github.com/ku-nlp/ja-vicuna-qa-benchmark/blob/main/llm_judge/common.py#L177 https://github.com/ku-nlp/ja-vicuna-qa-benchmark/blob/main/llm_judge/common.py#L184

この部分の処理が意図されているものかどうか教えていただきたいです。よろしくお願いいたします。

hkiyomaru commented 10 months ago

ご報告ありがとうございます．意図した動作ではありません．早急に修正します．

hkiyomaru commented 10 months ago

v1.0.0 時点では正常に処理されていることを確認しました．v2.0.0 以降のバグです．

hkiyomaru commented 10 months ago

修正して v2.0.3 としてリリースしました．

リポジトリに含めているモデルの修正前・修正後のスコアはこちらです．

修正前:

                                                    model_1                   model_2  win_rate  lose_rate  adjusted_win_rate
3                    tokyotech-llm--Swallow-70b-instruct-hf  openai--text-davinci-003      46.2       42.5               51.9
5  llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0  openai--text-davinci-003      28.7       62.5               33.1
0             rinna--japanese-gpt-neox-3.6b-instruction-ppo  openai--text-davinci-003      13.8       76.2               18.8
1          rinna--japanese-gpt-neox-3.6b-instruction-sft-v2  openai--text-davinci-003       8.8       82.5               13.1
4                                 cyberagent--calm2-7b-chat  openai--text-davinci-003       6.2       81.2               12.5
2  llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0  openai--text-davinci-003      10.0       87.5               11.2

修正後:

                                                    model_1                   model_2  win_rate  lose_rate  adjusted_win_rate
3                    tokyotech-llm--Swallow-70b-instruct-hf  openai--text-davinci-003      46.2       42.5               51.9
5  llm-jp--llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0  openai--text-davinci-003      27.5       60.0               33.8
0             rinna--japanese-gpt-neox-3.6b-instruction-ppo  openai--text-davinci-003      13.8       75.0               19.4
1          rinna--japanese-gpt-neox-3.6b-instruction-sft-v2  openai--text-davinci-003       8.8       81.2               13.8
2  llm-jp--llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0  openai--text-davinci-003      10.0       82.5               13.8
4                                 cyberagent--calm2-7b-chat  openai--text-davinci-003       6.2       78.8               13.8

修正に合わせて，評価結果のファイルを読み込み，勝者情報を更新して上書きするスクリプトを追加しました．手元に v2.0.0 以降の結果があればご使用ください．

python llm_judge/reparse_pairwise_judgement.py

stjohn2007 commented 10 months ago

ご対応ありがとうございました！

ku-nlp / ja-vicuna-qa-benchmark

The judgment of tie is not handled appropriately #47