LLaMA3 report - Githubissues

Hi, thanks for the great work.

I noticed that you have fp8 checkpoints for llama3.1-405b. However, in the technical report of LLaMA-3 section 6.2, their team mentioned that they had some work-arounds for effectively quantizing the largest 405b model due to some difficulties. For example, they opt to skip the attention layers; also they skip the first and last Transformer block.

I wonder if you have seen similar issues when quantizing llama3.1-405b.

neuralmagic / AutoFP8

LLaMA3 report #37