guinmoon / LLMFarm

llama and other large language models on iOS and MacOS offline using GGML library.
https://llmfarm.tech
MIT License
1.24k stars 76 forks source link

Performance/benchmarks #14

Closed chen-rn closed 8 months ago

chen-rn commented 11 months ago

Hey, first of all, great repo! ESP the TestFlight part so that we can test it without having to download the dev env. Ie. I downloaded the app and had Llama2 7B running in a minute! Kudos 👏

My daily driver is quite weak(iPhone 12 mini), so the inference speed is something like 5-10 seconds per word.

I'm curious if you have a performance benchmark for each models for the different phones. Could be as simple as a video demonstration.

That would greatly help developers/designers build an intuition as to what products would be feasible at the moment!

guinmoon commented 11 months ago

I don't have a performance benchmark for each model for different phones. But my friend has an iphone 12 mini and I was able to run these models on him. https://huggingface.co/guinmoon/SantaCoder-1B-GGUF/resolve/main/SantaCoder-1B-Q5_K.gguf

https://huggingface.co/guinmoon/LLMFarm_Models/resolve/main/q4_1-RWKV-4-Raven-1B5-v12-Eng98%25-Other2%25-20230520-ctx4096.bin

https://huggingface.co/guinmoon/Cerebras-1.3B-ggml/resolve/main/Cerebras-1.3B-ggjtv3-q5_1.bin

chen-rn commented 11 months ago

I'll give those a try and see how fast they run!

Do you have a more performant phone?

TrajansRow commented 11 months ago

~~I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.

This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.~~

(see correction in reply)

TrajansRow commented 11 months ago

Another point of reference on the iPhone 15 Pro using "SantaCoder-1B-Q5_K" generated the following code snippet in 7.2 seconds:

#include <vector> // vector is a collection of generic types
class AFactory<T> : public Factory<A<T>> {

    protected:
    // the constructor, takes no arguments
    AFactory() : super() {}

    public:
    // return a new object instance based on this factory
    virtual ~> A<T> newInst() const; // this is the only difference from the parent

    // create a new child class instance. This should be equivalent to:
    // protected virtual ~> C<T> newChildClass(A<T>) const { return new C<T>(this, arg);const
}
*/
public class AFactory{
  /// The factory instance used in the creation of this object.
  private static AFactory<String> factory;

  public static synchronized AFactory<String> getFactory()
      throws InstantiationException, IllegalAccessException {
    if (factory == null)
        factory = new AFactory();

    return factory;
  }
}

I would call this very good performance, and could be quite useful.

TrajansRow commented 11 months ago

I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.

This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.

I ran the same test again at a later time, and I saw a complete reversal of performance. Token generation was more like eight tokens per second. This is much better than the first test, and I'm not sure why the performance difference. It's possible that there was some memory contention originally. It might be worth playing around with the Mlock settings. What it does show is that good performance from a 7B model can still be had on mobile.

chen-rn commented 11 months ago

Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?

guinmoon commented 11 months ago

Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?

I think for iphone pro models it's orca 3B or Marx-3B-V2

guinmoon commented 11 months ago

~~I'm running the project out-of-the-box on an iPhone 15 Pro, and am seeing 4 seconds per token on the 7B parameter model "yarn-mistral-7b-128k.Q4_K_S" using the default settings.

This is probably not fast enough to be practical, but perhaps there are some tuning options that could help. A smaller model would of course be faster, or different quantization.~~

(see correction in reply)

7B Models larger than q3_K_M use extended memory and are extremely slow on iphones.

ShawnFumo commented 9 months ago

Oh whoa! Those are pretty good numbers. Is there a model that comes to mind when it comes to having a great performance/speed ratio?

I think for iphone pro models it's orca 3B or Marx-3B-V2

And for anyone seeing the this now, support was added for the newer Stability 3b models. The Rocket 3b seems very good. I also tried the Zephyr tuned one recently released from Stability and work well too, but no system prompt and has a lot of extraneous lm_end type tokens in the responses.

TrajansRow commented 9 months ago

The Llama.cpp project has now published performance numbers for Apple A series:

https://github.com/ggerganov/llama.cpp/discussions/4508