pkulcwmzx / knowledge-boundary

[ACL 2024] Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation
MIT License
4 stars 2 forks source link

Issues about reproducing the baseline and search experiments #2

Open BUGLI27 opened 2 weeks ago

BUGLI27 commented 2 weeks ago

There are two issues when reproducing the experiments.

baseline (P-zero)

The P-zero acc of LLaMA2 on KAssess is only 35.22%, much lower than the paper's 50.00%. (sampling_params = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=10)) {"success": 0.35215812827753123}

search

I notice that when qs are multiple questions, the output will be a new sentence instead of the answer to the origin question. For example:

qs
0 = 'Estelita Rodriguez was an accomplished'
1 = 'Estelita Rodriguez was known for being a'
2 = 'Estelita Rodriguez pursued a career as a'
3 = 'Estelita Rodriguez was a'

tokenizer.decode(curr_q_tensor)
0 = 'Estelita Rodriguez was an accomplished</s></s></s></s>'
1 = 'Estelita Rodriguez was known for being a</s></s>'
2 = 'Estelita Rodriguez pursued a career as a</s>'
3 = 'Estelita Rodriguez was a</s></s></s></s></s>'

tokenizer.decode(decode_toks)
0 = '<s> Estelita Rodriguez was an accomplished actress'
1 = '<s>\n Estelita Rodriguez was known for'
2 = '<s>\nEstelita Rodriguez was a Cub'
3 = '<s> She was born on August 23, '

However, when qs is one question, the output is the answer of the question, for example:

qs
0 = 'Estelita Rodriguez was an accomplished'

tokenizer.decode(curr_q_tensor)
0 = 'Estelita Rodriguez was an accomplished'

tokenizer.decode(decode_toks)
0 = ('actress who was best known for her roles in the TV')

Is it due to the pad strategy? Generating a new question instead of answering the original question affects the success rate of matching answers, right?

pkulcwmzx commented 1 week ago

The difference in the baseline result mainly comes from the length of the generated answer, which is the max_tokens in the sampling parameter of line 191. You can try using a longer response length like 50.