lrw04 / llama2.c-to-ncnn

A converter for llama2.c legacy models to ncnn models.
MIT License
82 stars 10 forks source link

how about the performance compare with CPU? #1

Closed caofx0418 closed 11 months ago

caofx0418 commented 1 year ago

hi, how about the performance compare with CPU?

thank you!

lrw04 commented 1 year ago

I ran it on Apple M2 with the 13B foundational model, and the generation time of the first 100 tokens average out to be 699 ms/token. This is slower than llama.cpp since for example it doesn't have flash attention for now. If there is demand, I might tinker with it a bit more to potentially add it to the converter.

lrw04 commented 1 year ago

Another note: the LinearInt8 layer in the output model has only ARM NEON acceleration for now, so everything ran on the CPU.

caofx0418 commented 1 year ago

Another note: the LinearInt8 layer in the output model has only ARM NEON acceleration for now, so everything ran on the CPU.

thank you for your reply, do you have any plan to write GPU instrinct for LinearInt8 layer?

lrw04 commented 1 year ago

I don't plan to write GPU code before I finish my current project.