Llama2 7B Quantization Examples - Githubissues

neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models

Apache License 2.0

2.07k stars 148 forks source link

Llama2 7B Quantization Examples #2285

Closed Satrat closed 6 months ago

Satrat commented 6 months ago

Creating a new examples folder, with initial examples for llama7b using ultrachat200k

W4A16 channelwise (notebook and python script)
W8A8 channelwise weights, dynamic per token activations (python script)
2:4 sparsity -> finetuning -> W4A16 channelwise quantization (README and python script)

Results

Models

Storing the model outputs on network under /network/sadkins

llama7b_w4a16_channel_compressed
llama7b_w8a8_channel_dynamic_compressed

Eval Results

sparseml.evaluate /network/sadkins/llama1.1b_W4A16_channel_compressed -d wikitext -i lm-evaluation-harness

zoo:llama2-7b-ultrachat200k_llama2_pretrain-base (baseline): 10.10
llama7b_w4a16_channel_compressed: 11.09
llama7b_w8a8_channel_dynamic_compressed: 10.17

Missing

2:4 quantized run is waiting on the GPTQ UX changes to merge, currently the originally sparsity is not respected by GPTQ. So far I've just confirmed the script completes with a 1.1b model.
Waiting on group quantization correctness fixes before validating a 7b grouped example

review-notebook-app[bot] commented 6 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB