Open YJYJLee opened 2 months ago
I'm actually pretty happy to see this bug report lol, we were hoping to enable torchao in gpt-fast and delete the broken quantization flows. We do need a fork of gpt-fast locally because we are making model changes and unfortunately there isnt' a good solution for us outside of occasionally syncing upstream.
So an action item for @HDCharles is to fix the existing code here in AO but @YJYJLee would you be open to contributing your patch to gpt-fast directly as well? We're doing a big launch on Sep 21 at the CUDA MODE IRL conference and were hoping to feature an integration with gpt-fast by then. Granted would highly recommend you try out PyTorch nightlies first
I can look at it but in reality these are updates to TorchAO, not gpt-fast i.e. TorchAO's model/generate code is more up to date than gpt-fast, rather than vice versa
Thanks for the great work! I tried to enable AutoQuant on top of the latest gpt-fast repository since gpt-fast version that ao repo is providing as an example is outdated.
Here is the diff of enabling AutoQuant on top of the latest gpt-fast codebase.
But I'm getting the error mentioning "CUDA generator expects graph capture to be underway, but the current stream is not capturing".
Attaching the env info here
Thanks for the help in advance!