tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

On branch add_save_load #1

Closed calhounpaul closed 1 year ago

calhounpaul commented 1 year ago

Changes to be committed: modified: README.md modified: example.py modified: requirements.txt

calhounpaul commented 1 year ago

This works on a 3090 w/ nvidia's cuda117 docker image. I just had to compile bnb from Tim's version, not straight from pip.

tloen commented 1 year ago

I've been thinking about adding something like this, but I'm not quite sure this code does what it says it does. The weights are only quantized at the model.cuda() step, so when I test this on my local machine I don't actually get any integer weights in the zipfile.

calhounpaul commented 1 year ago

Oh thanks. Moved some stuff around and it seems to work now. Output weights are nearly the same size though.