rladmstn1714 / CLIcK

CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
39 stars 1 forks source link

question about reproducing results #6

Closed taeminlee closed 1 day ago

taeminlee commented 3 months ago

Hello, I am attempting to reproduce the results based on the pape using lm-evaluation-harness framework for this reproduction. I utilized the prompts from Chapter 4 of the paper and selected the answer option with the highest log-likelihood of the alphabet.

detailed implementation method here.

Using Polyglot 1.3B for the measurement, I obtained the following results:

These results are somewhat similar to the ones you shared, but there are slight differences.

I included both Korean and English in the prompt section, and I am wondering if this caused the issue.

I have attached the execution logs for your review.

Thank you.


2024-05-22:18:11:36,275 INFO [evaluator_utils.py:142] Request: Instance(request_type='loglikelihood', doc={'question': '다음은 한국의 대중문화에 대한 문제이다.\n세계 영화제에서 수상한 영화와 그 수상내용을 잘못 연결한 것은?', 'choices': ['임권택의 <서편제> - 베를린영화제 감독상', '김기덕의 <빈집> - 베니스영화제 감독상', '이창동의 <밀양> - 칸영화제 여우주연상', '박찬욱의 <올드보이> - 칸영화제 심사위원대상'], 'answer': '임권택의 <서편제> - 베를린영화제 감독상', 'id': 'Kedu_popular_1', 'paragraph': '', 'labels': ['A', 'B', 'C', 'D'], 'label': 'A'}, arguments=(' 주어진 질문을 천천히 읽고, 질문에 대한 적절한 정답을 A, B, C, D중에 골라 알파벳 하나로 답하시오.\n (Read the given Question, and choose the correct answer from options A, B, C, or D. Respond with a single alphabet.)\n 질문(Question): 다음은 한국의 대중문화에 대한 문제이다.\n세계 영화제에서 수상한 영화와 그 수상내용을 잘못 연결한 것은?\n 보기(Options): A: 임권택의 <서편제> - 베를린영화제 감독상 B: 김기덕의 <빈집> - 베니스영화제 감독상 C: 이창동의 <밀양> - 칸영화제 여우주연상 D: 박찬욱의 <올드보이> - 칸영화제 심사위원대상 \n 정답(Answer): ', ' D'), idx=3, metadata=('click_culture_popular', 0, 1), resps=[], filtered_resps={}, task_name='click_culture_popular', doc_id=0, repeats=1) 2024-05-22:18:11:36,275 INFO [evaluator.py:395] Running loglikelihood requests Running loglikelihood requests: 100% ██████████████████████████████████████████████████ 8236/8236 [00:32<00:00, 256.78it/s] 2024-05-22:18:12:22,720 INFO [evaluation_tracker.py:132] Saving results aggregated hf (pretrained=EleutherAI/polyglot-ko-1.3b), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 16 Tasks Version Filter n-shot Metric Value Stderr
click_culture N/A none 0 acc_norm 0.3480 ± 0.0128
none 0 acc 0.3480 ± 0.0128
- click_culture_economy 1 none 0 acc 0.4407 ± 0.0652
none 0 acc_norm 0.4407 ± 0.0652
- click_culture_geography 1 none 0 acc 0.3282 ± 0.0412
none 0 acc_norm 0.3282 ± 0.0412
- click_culture_history 1 none 0 acc 0.2214 ± 0.0249
none 0 acc_norm 0.2214 ± 0.0249
- click_culture_law 1 none 0 acc 0.3151 ± 0.0315
none 0 acc_norm 0.3151 ± 0.0315
- click_culture_politics 1 none 0 acc 0.4048 ± 0.0539
none 0 acc_norm 0.4048 ± 0.0539
- click_culture_popular 1 none 0 acc 0.4390 ± 0.0785
none 0 acc_norm 0.4390 ± 0.0785
- click_culture_society 1 none 0 acc 0.4466 ± 0.0283
none 0 acc_norm 0.4466 ± 0.0283
- click_culture_tradition 1 none 0 acc 0.3514 ± 0.0321
none 0 acc_norm 0.3514 ± 0.0321
click_language N/A none 0 acc_norm 0.2000 ± 0.0157
none 0 acc 0.2000 ± 0.0157
- click_language_functional 1 none 0 acc 0.1429 ± 0.0305
none 0 acc_norm 0.1429 ± 0.0305
- click_language_grammar 1 none 0 acc 0.2241 ± 0.0274
none 0 acc_norm 0.2241 ± 0.0274
- click_language_textual 1 none 0 acc 0.2070 ± 0.0240
none 0 acc_norm 0.2070 ± 0.0240
Groups Version Filter n-shot Metric Value Stderr
click_culture N/A none 0 acc_norm 0.348 ± 0.0128
none 0 acc 0.348 ± 0.0128
click_language N/A none 0 acc_norm 0.200 ± 0.0157
none 0 acc 0.200 ± 0.0157
scottsuk0306 commented 3 months ago

Hi @taeminlee, thank you for your further work based on CLIcK! We actually measured the highest log likelihood of not only alphabet but concatenation of alphabet and option string ("{alphabet}. {option_string}").

For polyglot 1.3b, I think the model is prone to positional bias because it's too small. Maybe this might be the reason for the discrepancy between the performance of measuring the log likelihood of alphabet only and alpabet+option_string.

taeminlee commented 3 months ago

Thank you for your response. I will change the format of the choices and run it again. One more thing I'm curious about is, since the prompt in the paper asks for an answer in a single letter, the prompt wording also needs to be changed. Below is a prompt created with reference to a paper, and I am wondering how it should be modified.

주어진 질문을 천천히 읽고, 질문에 대한 적절한 정답을 A, B, C, D중에 골라 알파벳 하나로 답하시오.\n (Read the given Question, and choose the correct answer from options A, B, C, or D. Respond with a single alphabet.)\n 질문(Question):

rladmstn1714 commented 3 months ago

Despite measuring the highest log likelihood of "{alphabet}. {option_string}", We used the same prompt. If you want to revise the original evaluation procedure, possible methods could be

  1. revise the prompt to " A, B, C, D중에 골라 답하시오" from " A, B, C, D중에 골라 알파벳 하나로 답하시오".
  2. calculate the highest log likelihood only for the "{alphabet}".