Open msaroufim opened 5 months ago
8-bit Adam from bitsandbytes. Resources for reference
@msaroufim
Would love to work on this.
torch.compile
and the rest of torchao
primitives. Or are you thinking of integrating the original 8-bit version as a custom cuda op? character.ai
blog) or post-training (dynamic) methods? For the latter, I think some of the major buckets are KV cache offloading, compression / quantization, and eviction (i.e. token pruning). Could further categorize by methods that compress at the layer, head, token, and hidden dim level.torch.profiler
as well as extending it for even more fine-grained metrics.int8_weight_only
or the dynamically quantized version. So generally I want us to follow the heuristic of first try compile()
and if that doesn't work then Triton and if that doesn't work then integrate the original as a custom op. I'd be fine if you need to integrate the original bnb kernel as a custom op if it makes testing in CI easiertorchao.utils
or torchao.benchmark
would do wonders@msaroufim
RE: profiling
For metrics the most important ones are memory bandwidth and flop utilization. A good representative workload for now is probably llama2 and llama3 https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py and this script has good metric instrumentation already so extending it feels natural
And for specific algorithms to test out I'd be most curious about testing out
From our README.md
And so far we've done a good job building out the primitive data types along with their corresponding transformed Linear Layers so for example given a new
ExoticDtype()
we have a playbook to createExoticDtypeLinear()
and indeed for weight only transformations this is a perfectly fine workflow and how the majority of quantization libraries operate.For example
We can make the above shine with more accessible blogs and performance benchmarks and integrations with more partners
However, this is doing somewhat of a disservice at explaining the ao value proposition. For example, we're a dtype library and not a dtype Linear library so given a dtype it should be easy for us to do a lot more. So some examples I'd like to see next are
None of the above is "research", this is very much the way engineering is moving for inference https://blog.character.ai/optimizing-ai-inference-at-character-ai/
Also given an exotic quantization schema I'd like to be more proactive in helping people benchmark their models so this should include