google-ai-edge / ai-edge-torch

Supporting PyTorch models with the Google AI Edge TFLite runtime.
Apache License 2.0
281 stars 36 forks source link

tunning text_generator_main.cc performance #155

Open nigelzzz opened 3 weeks ago

nigelzzz commented 3 weeks ago

Description of the bug:

Hi, when i using text_generator_main.cc to inference, i found the performance is slower ollama or other inference engine, Do you have any suggestion to tuning the model or imporve it in text_generator_main.cc? i using tinlyllama model

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

pkgoogle commented 3 weeks ago

Hi @nigelzzz, can you provide some profiling data in both these cases? I think we care more about using the model/measuring performance in a real setting (i.e. in a mobile app). So I would say go ahead and use ollama if you feel that suits your needs better for now.

nigelzzz commented 3 weeks ago

Hi @pkgoogle , thanks i got it, and i think tensorflow is good framework on embedded system, i guess maybe the bottleneck is model. can i know does the team have any plan about computer graph optimazation?

pkgoogle commented 3 weeks ago

Hi @nigelzzz, "computer graph optimization" is very general and I think the entire point for compiling with the current graph computation paradigm ... so we are always doing computer graph optimization... continuously, unless you meant something different or if you had a more specific question?

nigelzzz commented 3 weeks ago

Hi @pkgoogle , thanks for your response again!! Because i am not familiar with deep learning, if i need to learn computer graph optimation, how to get more information in ai-edge-torch. e.g., how to implement in ai-edge-torch source code. I think learn it on ai-edge-torcht that can enhance my knowledge on optimaztion in deeplearning field.

pkgoogle commented 2 weeks ago

Oh I see, no worries. If you are in school, I highly recommend to see what internal resources you may have available first, otherwise, perhaps look at the numerous MOOCS/youtube videos available for deep learning. If you have windows I recommend you start with WSL, otherwise stick w/ your local OS. For AI-Edge-Torch specifically ... I recommend you implement, test, evaluate some basic Neural Networks on your own with PyTorch and Python https://pytorch.org/tutorials/beginner/basics/intro.html. I recommend understanding the history before transformers as well. For graph optimization... understanding the loss function, back propagation, gradient descent will help you understand why backprop optimizes the network towards minimizing the loss function. (This requires some multivariable calculus [backprop is the derivative of the function that the NN computes] & basic linear algebra [mainly to understand the notation]). Currently all optimization is just that, gradient descent on the function (this function has parameter # of variables) that the NN computes. It doesn't seem like this will change soon ... but other techniques to optimize this function may be found in the future.

So for AI-Edge-Torch ... you have to understand a little bit MLIR and compilers. Like how do you convert a graph representation to binary that works on multiple, sometime heterogenous hardware -- how to do this "optimally" -- that is the problem that AI-Edge-Torch in some sense is solving.

github-actions[bot] commented 1 week ago

Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.

nigelzzz commented 1 week ago

Hi @pkgoogle , thanks for your suggestion, it's useful for me. i will try to study multivariable. (https://ocw.mit.edu/courses/18-02sc-multivariable-calculus-fall-2010/) and try to understand MLIR

does the ai-edge-torch have any performance report. e.g., how many token decode in second, if using c++ inference in example code.

pkgoogle commented 1 week ago

Hi @nigelzzz, so AI-Edge-Torch is at the end of the day is a Pytorch -> TFLite converter... most of the time people don't care about how performant the conversion takes, but they do care about how fast the produced model is... is that what you are looking for? In which case we have the benchmark tool in TFLite: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark.

For decoding tokens per second... I think you will have to time it in C++, if you don't need the highest precision (i.e. you don't actually need to know the exact number of CPU cycles needed to execute that part of the code) you can just time it yourself w/ the \<chrono> standard library. Otherwise you'll need a profiling tool, like Valgrind... I would research the various tools and their tradeoffs and figure out which works best for your situation.

For multivariable calculus, understanding what a gradient and partial derivatives are, is the most important thing today. (Once you understand the backprop proof, or compute the gradient of a simple NN manually once -- that's mainly what's needed for ML, but of course feel free to go further).

github-actions[bot] commented 2 days ago

Marking this issue as stale since it has been open for 7 days with no activity. This issue will be closed if no further activity occurs.