And also generated output of the following script looks fine to me.
python examples/run_streaming_llama.py --enable_streaming --recent_size 128 --start_size 0
Am I doing something wrong? (choice of model or dataset could matter??)
Is it okay to conclude that major factor which harms generation performance is wrongly-used pos encoding?
I've run a number of experiments and it looks like that most of the performance comes from enabling pos_shift.
And also generated output of the following script looks fine to me. python examples/run_streaming_llama.py --enable_streaming --recent_size 128 --start_size 0
Am I doing something wrong? (choice of model or dataset could matter??) Is it okay to conclude that major factor which harms generation performance is wrongly-used pos encoding?