LLaMA backbone ignored the start_index parameter when computing the rotary embeddings which lead to numerical issues during generation. This PR fixes it along with the reverse embedding layer in both Mistral and LLaMA: run the reverse embedding stage in compute_dtype instead of full-precision. This is how HF does it, so helps get the numerics closer.
LLaMA backbone ignored the start_index parameter when computing the rotary embeddings which lead to numerical issues during generation. This PR fixes it along with the reverse embedding layer in both Mistral and LLaMA: run the reverse embedding stage in
compute_dtype
instead of full-precision. This is how HF does it, so helps get the numerics closer.