Closed RazeBerry closed 2 months ago
Small update on the issue, for some reason when I use DQ and runs it without error, ironically the speed of transcription becomes ~40% slower. Medium on M2 Pro CPU transcribes about ~1.7sec/s without DQ, 1sec/s with DQ.
There are many factors that can cause the model to running slower with DQ. The overhead can make it run slower (e.g. DQ only ran faster for the large models in my testing). The slower start-up can also make it seem slower for short audio tracks. But DQ should run with a smaller memory footprint than normal.
@jianfch I have submitted a pull request which fixes it, please review: https://github.com/jianfch/stable-ts/pull/341
When I use DQ parameter in Stable-ts in the latest version, it without fails invokes: "RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine" And I have since then asked on PyTorch Github issues and that is apparently because in every code specifying linear engine in ARM environment:
The first line must be included along with the second line, otherwise it would always throw the Runtime error. I am hoping this would be fixed. Thank you ! Also I am not sure if this is a standalone fix of the problem or a part of the bigger problem with PyTorch.