question about reproducing results

taeminlee commented 3 months ago

Hello, I am attempting to reproduce the results based on the pape using lm-evaluation-harness framework for this reproduction. I utilized the prompts from Chapter 4 of the paper and selected the answer option with the highest log-likelihood of the alphabet.

detailed implementation method here.

Using Polyglot 1.3B for the measurement, I obtained the following results:

Culture: 0.348
Language: 0.200

These results are somewhat similar to the ones you shared, but there are slight differences.

I included both Korean and English in the prompt section, and I am wondering if this caused the issue.

I have attached the execution logs for your review.

Thank you.

2024-05-22:18:11:36,275 INFO [evaluator_utils.py:142] Request: Instance(request_type='loglikelihood', doc={'question': '다음은 한국의 대중문화에 대한 문제이다.\n세계 영화제에서 수상한 영화와 그 수상내용을 잘못 연결한 것은?', 'choices': ['임권택의 <서편제> - 베를린영화제 감독상', '김기덕의 <빈집> - 베니스영화제 감독상', '이창동의 <밀양> - 칸영화제 여우주연상', '박찬욱의 <올드보이> - 칸영화제 심사위원대상'], 'answer': '임권택의 <서편제> - 베를린영화제 감독상', 'id': 'Kedu_popular_1', 'paragraph': '', 'labels': ['A', 'B', 'C', 'D'], 'label': 'A'}, arguments=(' 주어진 질문을 천천히 읽고, 질문에 대한 적절한 정답을 A, B, C, D중에 골라 알파벳 하나로 답하시오.\n (Read the given Question, and choose the correct answer from options A, B, C, or D. Respond with a single alphabet.)\n 질문(Question): 다음은 한국의 대중문화에 대한 문제이다.\n세계 영화제에서 수상한 영화와 그 수상내용을 잘못 연결한 것은?\n 보기(Options): A: 임권택의 <서편제> - 베를린영화제 감독상 B: 김기덕의 <빈집> - 베니스영화제 감독상 C: 이창동의 <밀양> - 칸영화제 여우주연상 D: 박찬욱의 <올드보이> - 칸영화제 심사위원대상 \n 정답(Answer): ', ' D'), idx=3, metadata=('click_culture_popular', 0, 1), resps=[], filtered_resps={}, task_name='click_culture_popular', doc_id=0, repeats=1) 2024-05-22:18:11:36,275 INFO [evaluator.py:395] Running loglikelihood requests Running loglikelihood requests: 100%	██████████████████████████████████████████████████	8236/8236 [00:32<00:00, 256.78it/s] 2024-05-22:18:12:22,720 INFO [evaluation_tracker.py:132] Saving results aggregated hf (pretrained=EleutherAI/polyglot-ko-1.3b), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 16	Version	Filter	n-shot	Metric
click_culture	N/A	none	acc_norm	0.3480	±	0.0128
		none	acc	0.3480	±	0.0128
- click_culture_economy	1	none	acc	0.4407	±	0.0652
		none	acc_norm	0.4407	±	0.0652
- click_culture_geography	1	none	acc	0.3282	±	0.0412
		none	acc_norm	0.3282	±	0.0412
- click_culture_history	1	none	acc	0.2214	±	0.0249
		none	acc_norm	0.2214	±	0.0249
- click_culture_law	1	none	acc	0.3151	±	0.0315
		none	acc_norm	0.3151	±	0.0315
- click_culture_politics	1	none	acc	0.4048	±	0.0539
		none	acc_norm	0.4048	±	0.0539
- click_culture_popular	1	none	acc	0.4390	±	0.0785
		none	acc_norm	0.4390	±	0.0785
- click_culture_society	1	none	acc	0.4466	±	0.0283
		none	acc_norm	0.4466	±	0.0283
- click_culture_tradition	1	none	acc	0.3514	±	0.0321
		none	acc_norm	0.3514	±	0.0321
click_language	N/A	none	acc_norm	0.2000	±	0.0157
		none	acc	0.2000	±	0.0157
- click_language_functional	1	none	acc	0.1429	±	0.0305
		none	acc_norm	0.1429	±	0.0305
- click_language_grammar	1	none	acc	0.2241	±	0.0274
		none	acc_norm	0.2241	±	0.0274
- click_language_textual	1	none	acc	0.2070	±	0.0240
		none	acc_norm	0.2070	±	0.0240

Groups	Version	Filter	Metric	Value		Stderr
click_culture	N/A	none	acc_norm	0.348	±	0.0128
		none	acc	0.348	±	0.0128
click_language	N/A	none	acc_norm	0.200	±	0.0157
		none	acc	0.200	±	0.0157

scottsuk0306 commented 3 months ago

Hi @taeminlee, thank you for your further work based on CLIcK! We actually measured the highest log likelihood of not only alphabet but concatenation of alphabet and option string ("{alphabet}. {option_string}").

For polyglot 1.3b, I think the model is prone to positional bias because it's too small. Maybe this might be the reason for the discrepancy between the performance of measuring the log likelihood of alphabet only and alpabet+option_string.

taeminlee commented 3 months ago

Thank you for your response. I will change the format of the choices and run it again. One more thing I'm curious about is, since the prompt in the paper asks for an answer in a single letter, the prompt wording also needs to be changed. Below is a prompt created with reference to a paper, and I am wondering how it should be modified.

주어진 질문을 천천히 읽고, 질문에 대한 적절한 정답을 A, B, C, D중에 골라 알파벳 하나로 답하시오.\n (Read the given Question, and choose the correct answer from options A, B, C, or D. Respond with a single alphabet.)\n 질문(Question):

rladmstn1714 commented 3 months ago

Despite measuring the highest log likelihood of "{alphabet}. {option_string}", We used the same prompt. If you want to revise the original evaluation procedure, possible methods could be

revise the prompt to " A, B, C, D중에 골라 답하시오" from " A, B, C, D중에 골라 알파벳 하나로 답하시오".
calculate the highest log likelihood only for the "{alphabet}".

rladmstn1714 / CLIcK

question about reproducing results #6