pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.35k stars 484 forks source link

fixing over padding and GPTQ padding bug #100

Closed HDCharles closed 4 months ago

HDCharles commented 4 months ago

Stack from ghstack (oldest at bottom):

Summary: don't always need to pad to 1024, only that groupsize, inner_k_tiles*16 can divide into the inner_dim. Remove padding from QuantizedLinear module since it should always do padding, changed padding to padding_allowed in QuantHandler to clarify since padding was doing 2 jobs before (is padding allowed vs is this module padded)

Test Plan:

python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g32.pth --tasks wikitext --limit 5

python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4-gptq --calibration_tasks wikitext --calibration_limit 5 python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4-gptq.g32.pth --tasks wikitext --limit 5

wikitext: {'word_perplexity,none': 11.49343838017535, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6110947678444059, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6880413587732067, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

wikitext: {'word_perplexity,none': 11.232339081135366, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6038800882234914, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6815662848152432, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags: