[P1] If set output_original_output to True in intervenable.generate, can we get the model performance without intervention?

In file examples/loreft/compute_metrics.py, line 211 ori_response, steered_response = intervenable.generate(**generation_args), if I add generation_args["output_original_output"] = True and take ori_response as the same as steered_response, can I get the model performance without intervention?

I test on the command

python examples/loreft/train.py -task gsm8k \
-model yahma/llama-7b-hf \
-seed 42 -l all -r 4 -p f7+l7 -e 12 -lr 9e-4 \
-type NodireftIntervention \
-gradient_accumulation_steps 4 \
-batch_size 8 \
-eval_batch_size 4 \
--dropout 0.05 \
--test_split validation \
--use_normalized_template \
--greedy_decoding \
--warmup_ratio 0.00 \
--weight_decay 0.06

and the accuracy is 5.0, while the llama-7b on gsm8k is 11.0. I also tested the accuracy of the llama2-7b, which is 26.3, while the llama2-7b on gsm8k is 14.6; the llama3-8b is 23.7, while I did not find the public report of the llama3-8 B. Is there something wrong, or I misunderstand the meaning of output_original_output

stanfordnlp / pyreft

[P1] If set output_original_output to True in intervenable.generate, can we get the model performance without intervention? #111