Open radna0 opened 3 months ago
Hi, We have highlighted in the paper that we use BitBLAS for conducting those experiments. However, BitBLAS can be challenging to install and is only compatible with NVIDIA GPUs. In fact, we even had to recompile it during our installation process. For those reasons, we haven't merged it into this repo yet. Additionally, due to the different ways FuseBitLinear stores weights, there is still some compatibility work that needs to be completed.
We are also working on merging MatmulFreeLLM into BitBLAS examples. In the meantime, you can try Bitnet's example to achieve a similar level of VRM reduction, which should be comparable to our model.
I see, so we would still have to wait for the repo to be fully functionally working with BitBLAS until that we can not experience the results from the paper nor do training, right?
For training it is okay, since we have integrated triton in our current repo, so you can still enjoy the accelerated training, for inference maybe not…
Wait, so you could still train a model and get faster training + vram reduction? It just doesn't work on inference? I might be wrong here but how would we evaluate the model during and after training for the losses, ouputs?
A little bit of context, I'm wanting to train a video generative model
You can refer to an and b, these two figures show that how our fused bilinear help to reduce the memory and training speed. (in pure MLP situation)
I tried running the following code, with just having the ```ridger/MMfreeLM-1.3B```` model initialized:
Having another terminal opened with 'watch rocm-smi', showing 68% VRAM usage meaning about 5.5GB
Contradicting what was said in the paper?