Hello.
First of all, thanks for sharing a bitnet training code.
I have a question about GPU memory usage.
As I understanding, bitnet can reduce VRAM usage compared to fp16/bf16 precision.
However, by commenting code in the train_bitnet.py
model = apply_bitlinear(model, target_layers=target_layers) # comment this to train og llama
memory usage is reduced about 2G.
(with bitnet layer, it used 13G v.s. w/o bitnet layer, 11G)
Doesn't it make sense that using bitnet would actually result in lower memory usage?
Hello. First of all, thanks for sharing a bitnet training code.
I have a question about GPU memory usage. As I understanding, bitnet can reduce VRAM usage compared to fp16/bf16 precision. However, by commenting code in the train_bitnet.py
model = apply_bitlinear(model, target_layers=target_layers) # comment this to train og llama
memory usage is reduced about 2G. (with bitnet layer, it used 13G v.s. w/o bitnet layer, 11G)Doesn't it make sense that using bitnet would actually result in lower memory usage?
Thanks.