Closed cfifty closed 1 month ago
@cfifty hi Chris, thank you for this pull request and for the kind words
I just ran your STE variant with rotation through the fashionmnist example... and saw code utilization go from ~25% (non-rotation) to 100% ... congratulations on this finding and for sharing this paper!
@cfifty just when i was about to turn to scalar quantization and not look back :rofl:
@cfifty also, go big red :smile:
Yooo! @lucidrains I had no idea! Always great to meet / interact w/ another Cornellian in Silicon Valley :) Surprisingly high concentration of us out here (++ Chris Ré --last author on that paper & Stanford Prof--was also a Math Cornell ugrad)
haha, i know Chris Ré of course ever since the flash attention paper, but didn't know he is a fellow alum!
must be confusing to have an advisor with the same first name lol (just noticed that)
congrats again on this paper! could be significant!
Thanks Phil :)
The rotation trick is a new way to propagate gradients through vector quantization layers [different from the STE estimate].
See https://arxiv.org/abs/2410.06424
As an aside, this repository was quite helpful for the experiments in that paper -- thank you.