pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.57k stars 507 forks source link

[quant] Add int8 per token dynamic quant + int4 per group quant for ExecuTorch #102

Closed jerryzh168 closed 2 months ago

jerryzh168 commented 7 months ago

Stack from ghstack (oldest at bottom):

Summary: att

Adding this for accuracy evaluation, we also added this in executorch repo and we'll dedup later

Test Plan:

quantization:

python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode 8da4w-gptq --calibration_tasks wikitext --calibration_limit 5

this finished in 20+ min in my machine if you change calibration_limit to 1, then it can be finished in 10+ min, but expect worse quality since we do less calibration (use this for debugging a new quantization experiment)

evaluation:

python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_8da4w-gptq.g32.pth --tasks wikitext

This should be fast, the result I'm getting is:

wikitext: {'word_perplexity,none': 10.15655335078972, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.5726497149737177, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6531973670369153, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:

jerryzh168 commented 6 months ago

we're going to add this to torchao instead