I noticed that you have fp8 checkpoints for llama3.1-405b. However, in the technical report of LLaMA-3 section 6.2, their team mentioned that they had some work-arounds for effectively quantizing the largest 405b model due to some difficulties. For example, they opt to skip the attention layers; also they skip the first and last Transformer block.
I wonder if you have seen similar issues when quantizing llama3.1-405b.
Hi, thanks for the great work.
I noticed that you have fp8 checkpoints for llama3.1-405b. However, in the technical report of LLaMA-3 section 6.2, their team mentioned that they had some work-arounds for effectively quantizing the largest 405b model due to some difficulties. For example, they opt to skip the attention layers; also they skip the first and last Transformer block.
I wonder if you have seen similar issues when quantizing llama3.1-405b.