I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than
ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can use GPU on devices where WebGPU is available thereby providing better performance
web-llm(which is WEBGPU only) - as it (will) have a CPU backend thereby allowing inference on devices where WEBGPU is not supported(many android browsers)
onnx - it is ligter than onnx
However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very much required because without that, it is impossible to run whisper-v3-turbo and llama-3.2-1b on browser with limited RAM. So, please support 4bit quantization soon.
I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than
However, all 3 of them support 4 bit quantization whereas (apparently) ratchet only supports 8 bit quantization. 4-bit quantization is very much required because without that, it is impossible to run whisper-v3-turbo and llama-3.2-1b on browser with limited RAM. So, please support 4bit quantization soon.