microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
412 stars 34 forks source link

Loading BitBlasLinear takes a lot of Time #152

Open MekkCyber opened 2 months ago

MekkCyber commented 2 months ago

Hello @LeiWang1999

I am trying to use the BitNet modeling in an other project to use bitblas kernels, when I load the model, and try to replace linear layers, with BitBlas Linear layers, the _get_or_create_bitblas_operator function takes a lot of time to execute and compile kernels based on the weight shape, for a model with 32 layers, with a hidden size of 4096 and intermediate size of 14336 it takes ~8 min. Is this an intended behaviour ? Thank you for your help

LeiWang1999 commented 2 months ago

Hi @MekkCyber , Yeah, when bitblas encounters a kernel configuration for the first time, it performs the compilation and stores the result in a database, which is located by default at ~/.cache/bitblas. The next time it encounters the same configuration, it retrieves the precompiled library directly from the database, bypassing the tuning process.

As a result, tuning only occurs the first time a specific model and its initial layer are encountered :)

LeiWang1999 commented 2 months ago

We’re also considering bypassing tuning by saving compilation results for different hardware setups, but this is challenging and may take some time to design and implement though :)

MekkCyber commented 2 months ago

Thanks a lot @LeiWang1999 much clearer now