Open phil71x opened 4 months ago
Thanks for flagging @phil71x! Note that the table is not wrong, but rather reflects the behaviour of the memory_extreme
preset: memory_extreme
will use all GPU memory available, and offload weights to the CPU if the entire model doesn't fit.
Since the GPU used to gather the numbers (A100 80GB) fits the 9b model entirely, all of the weights are on the GPU, and thus there is no CPU offload. In this case, memory_extreme
collapses to memory
. Only if the model weights didn't fit on the GPU would we see a difference here.
See this section for details: https://github.com/huggingface/local-gemma#preset-details
The 3rd line is a duplicate of the 2nd line