There are two issues when reproducing the experiments.
baseline (P-zero)
The P-zero acc of LLaMA2 on KAssess is only 35.22%, much lower than the paper's 50.00%.
(sampling_params = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=10))
{"success": 0.35215812827753123}
search
I notice that when qs are multiple questions, the output will be a new sentence instead of the answer to the origin question. For example:
qs
0 = 'Estelita Rodriguez was an accomplished'
1 = 'Estelita Rodriguez was known for being a'
2 = 'Estelita Rodriguez pursued a career as a'
3 = 'Estelita Rodriguez was a'
tokenizer.decode(curr_q_tensor)
0 = 'Estelita Rodriguez was an accomplished</s></s></s></s>'
1 = 'Estelita Rodriguez was known for being a</s></s>'
2 = 'Estelita Rodriguez pursued a career as a</s>'
3 = 'Estelita Rodriguez was a</s></s></s></s></s>'
tokenizer.decode(decode_toks)
0 = '<s> Estelita Rodriguez was an accomplished actress'
1 = '<s>\n Estelita Rodriguez was known for'
2 = '<s>\nEstelita Rodriguez was a Cub'
3 = '<s> She was born on August 23, '
However, when qs is one question, the output is the answer of the question, for example:
qs
0 = 'Estelita Rodriguez was an accomplished'
tokenizer.decode(curr_q_tensor)
0 = 'Estelita Rodriguez was an accomplished'
tokenizer.decode(decode_toks)
0 = ('actress who was best known for her roles in the TV')
Is it due to the pad strategy?
Generating a new question instead of answering the original question affects the success rate of matching answers, right?
The difference in the baseline result mainly comes from the length of the generated answer, which is the max_tokens in the sampling parameter of line 191. You can try using a longer response length like 50.
There are two issues when reproducing the experiments.
baseline (P-zero)
The P-zero acc of LLaMA2 on KAssess is only 35.22%, much lower than the paper's 50.00%. (
sampling_params = SamplingParams(temperature=0.0, top_p=0.1, max_tokens=10)
){"success": 0.35215812827753123}
search
I notice that when qs are multiple questions, the output will be a new sentence instead of the answer to the origin question. For example:
However, when qs is one question, the output is the answer of the question, for example:
Is it due to the pad strategy? Generating a new question instead of answering the original question affects the success rate of matching answers, right?