Closed chen-rn closed 8 months ago
I don't have a performance benchmark for each model for different phones. But my friend has an iphone 12 mini and I was able to run these models on him. https://huggingface.co/guinmoon/SantaCoder-1B-GGUF/resolve/main/SantaCoder-1B-Q5_K.gguf
https://huggingface.co/guinmoon/Cerebras-1.3B-ggml/resolve/main/Cerebras-1.3B-ggjtv3-q5_1.bin
I'll give those a try and see how fast they run!
Do you have a more performant phone?
~~I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.
This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.~~
(see correction in reply)
Another point of reference on the iPhone 15 Pro using "SantaCoder-1B-Q5_K" generated the following code snippet in 7.2 seconds:
#include <vector> // vector is a collection of generic types
class AFactory<T> : public Factory<A<T>> {
protected:
// the constructor, takes no arguments
AFactory() : super() {}
public:
// return a new object instance based on this factory
virtual ~> A<T> newInst() const; // this is the only difference from the parent
// create a new child class instance. This should be equivalent to:
// protected virtual ~> C<T> newChildClass(A<T>) const { return new C<T>(this, arg);const
}
*/
public class AFactory{
/// The factory instance used in the creation of this object.
private static AFactory<String> factory;
public static synchronized AFactory<String> getFactory()
throws InstantiationException, IllegalAccessException {
if (factory == null)
factory = new AFactory();
return factory;
}
}
I would call this very good performance, and could be quite useful.
I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.
This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.
I ran the same test again at a later time, and I saw a complete reversal of performance. Token generation was more like eight tokens per second. This is much better than the first test, and I'm not sure why the performance difference. It's possible that there was some memory contention originally. It might be worth playing around with the Mlock settings. What it does show is that good performance from a 7B model can still be had on mobile.
Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?
Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?
I think for iphone pro models it's orca 3B or Marx-3B-V2
~~I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.
This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.~~
(see correction in reply)
7B Models larger than q3_K_M use extended memory and are extremely slow on iphones.
Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?
I think for iphone pro models it's orca 3B or Marx-3B-V2
And for anyone seeing the this now, support was added for the newer Stability 3b models. The Rocket 3b seems very good. I also tried the Zephyr tuned one recently released from Stability and work well too, but no system prompt and has a lot of extraneous lm_end type tokens in the responses.
The Llama.cpp project has now published performance numbers for Apple A series:
Hey, first of all, great repo! ESP the TestFlight part so that we can test it without having to download the dev env. Ie. I downloaded the app and had Llama2 7B running in a minute! Kudos 👏
My daily driver is quite weak(iPhone 12 mini), so the inference speed is something like 5-10 seconds per word.
I'm curious if you have a performance benchmark for each models for the different phones. Could be as simple as a video demonstration.
That would greatly help developers/designers build an intuition as to what products would be feasible at the moment!