tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

Reduce RAM consumption on loading #3

Closed pamparamm closed 1 year ago

pamparamm commented 1 year ago

This change reduces RAM consumption during checkpoints loading. Tested on 7B and 13B.

Without this fix (7B model):

Allocating transformer on host
Loading checkpoint 0
Max RAM during loading: 25310.46875 MiB
Loaded in 15.42 seconds with 7.87 GiB

With:

Allocating transformer on host
Loading checkpoint 0
Max RAM during loading: 13405.296875 MiB
Loaded in 13.01 seconds with 7.87 GiB