Summary: don't always need to pad to 1024, only that groupsize,
inner_k_tiles*16 can divide into the inner_dim. Remove padding from QuantizedLinear module since it should always do padding, changed padding to padding_allowed in QuantHandler to clarify since padding was doing 2 jobs before (is padding allowed vs is this module padded)
Stack from ghstack (oldest at bottom):
83
97
Summary: don't always need to pad to 1024, only that groupsize, inner_k_tiles*16 can divide into the inner_dim. Remove padding from QuantizedLinear module since it should always do padding, changed padding to padding_allowed in QuantHandler to clarify since padding was doing 2 jobs before (is padding allowed vs is this module padded)
Test Plan:
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g32.pth --tasks wikitext --limit 5
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4-gptq --calibration_tasks wikitext --calibration_limit 5 python eval.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4-gptq.g32.pth --tasks wikitext --limit 5
wikitext: {'word_perplexity,none': 11.49343838017535, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6110947678444059, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6880413587732067, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
wikitext: {'word_perplexity,none': 11.232339081135366, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6038800882234914, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6815662848152432, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}
Reviewers:
Subscribers:
Tasks:
Tags: