Closed mdegans closed 1 month ago
There doesn't seem to be a way to programmatically check. In ggml-metal.m
there's a gigantic switch case for which the default case is to panic. This could be moved to a helper function which could be added to the public API. I'd rather not add the code to drama_llama
since every time a kernel gets added, I'll have to make a change. As it is, updating bindings is a matter of updating the submodule, building, and fixing the odd breaking API change.
This is fixed now in llama.cpp
by this PR
Turns out there was a static function to check if an op was supported but this was returning true for bf16 on Metal when this is not implemented. As a result it was hitting the assert. The behavior now is for unsupported layers to be run on the CPU. This is slower, but should be faster if we increase the number of threads the drama_llama::Engine
can use. Right now I think it defaults to 1. We can probably change that default to the number of virtual cpus or perhaps the number of performance cores if that can be determined easily.
Fixed with version 0.0.3
When a model is unsupported (at least on Metal) an assert in the
llama.cpp/ggml-metal.m
code causes a crash. To fix this we need to find a way to check for supported models by a given backend and fail gracefully. This is a good candidate to add tollama.cpp
itself since this is an issue with the library itself that we can't handle in our code without duplicating the potential fix.