MiscellaneousStuff commented 1 year ago

Would it be possible for you guys to add an option to enable dynamic quantization of the model when it's being run on a CPU? This would greatly improve the run-time performance of the OpenAI Whisper model (CPU-only) with minimal to no loss in performance.

The benchmarks for this are available here.

The implementation only requires adding a few lines of code using features which are already built into PyTorch.

Implementation

Quantization of the Whisper model requires changing the Linear() layers within the model to nn.Linear(). This is because you need to specifiy which layer types to dynamically quantize, such as:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

However the whisper model is designed to be adaptable, i.e. it can run at different precisions, so the Linear() layer contains custom code to account for this. However, this is not required for the quantized model. You can either change the Linear() layers in "/whisper/whisper/model.py" yourself (i.e. create a fork of OpenAI-Whisper which would be compatible with future merges), or you can use mine from here.

hayabhay commented 1 year ago

Could this be done by swapping the whisper packages underneath? -- pip install openai-whisper ++ pip install git+https://github.com/MiscellaneousStuff/whisper.git

MiscellaneousStuff commented 1 year ago

Yep. That submodule is exactly the same as the original but has swapped the Linear() layer for nn.Linear(). However, it also means that anyone wanting to run the model at half precision on GPU won’t be able to do it, should it only use that custom whisper module for dynamic quantisation on CPU.

hayabhay commented 1 year ago

Great! In that case, I'll add it as a note on Readme to swap out whisper for your fork if they intend to run it on a CPU only machine. Thanks!

hayabhay commented 1 year ago

Updated Readme here: 0431dee2eedac62c6ddae96c2145d801ffee3c15

menelic commented 1 year ago

Doing what is recommended in the Readme does not work:

Note: If you're using a CPU-only machine, your runtime can be sped-up by using quantization implemented by @MicellaneousStuff by swapping out pip install openai-whisper from requirements.txt and replacing it with their fork pip install git+https://github.com/MiscellaneousStuff/whisper.git (See related discussion here - https://github.com/hayabhay/whisper-ui/issues/20)

what exactly has to be put in the requirements.txt?

hayabhay / frogbase

CPU Dynamic Quantization #20

Implementation