NicolasMejiaPetit commented 4 months ago

Feature request

Is there any chance we coukd get this 4bit adam optimizer added to tranformers? It has nearly the same performance as 32bit adam with significant drop in vram overhead. repo Paper

With this added qlora would be even more memory efficient, and theoretically, you should be capable of FFT a 7b on a 24gb card.

Motivation

The github repo has a paper which shows negligible difference between 32bit and 4bit adam, and they have the code for the adam optimizer here: 4bit code

Or one bit adamw from deep speed, I only didn't recommend it since digging through the deep speed code, it isn't as laid out as this one is with a whole dedicated script. While yeah you can always use deepspeed and transformers, but deepspeed comes with its own set of draw backs, such as windows compatibility, and unsloth compatibility. But either or a one bit adamw would be awesome. Aside from that there aren't too many other ways to save memory for qlora. I mean theoretically if someone had the will, they could make a bitnet adamnw, that runs on the cpu. Since bit net doesn't need matmuls, the entire computation could be offloaded to the cpu. It would be actually fast, so the training wont be bogged down.

Your contribution

Submit feature request

amyeroberts commented 4 months ago

cc @younesbelkada

gau-nernst commented 1 month ago

huggingface / transformers

4bit Adam #30172

Feature request

Motivation

Your contribution

31865 👀