Reduce RAM consumption on loading

This change reduces RAM consumption during checkpoints loading. Tested on 7B and 13B.

Without this fix (7B model):

Allocating transformer on host
Loading checkpoint 0
Max RAM during loading: 25310.46875 MiB
Loaded in 15.42 seconds with 7.87 GiB

With:

Allocating transformer on host
Loading checkpoint 0
Max RAM during loading: 13405.296875 MiB
Loaded in 13.01 seconds with 7.87 GiB

tloen / llama-int8