Closed caofx0418 closed 11 months ago
I ran it on Apple M2 with the 13B foundational model, and the generation time of the first 100 tokens average out to be 699 ms/token. This is slower than llama.cpp since for example it doesn't have flash attention for now. If there is demand, I might tinker with it a bit more to potentially add it to the converter.
Another note: the LinearInt8 layer in the output model has only ARM NEON acceleration for now, so everything ran on the CPU.
Another note: the LinearInt8 layer in the output model has only ARM NEON acceleration for now, so everything ran on the CPU.
thank you for your reply, do you have any plan to write GPU instrinct for LinearInt8 layer?
I don't plan to write GPU code before I finish my current project.
hi, how about the performance compare with CPU?
thank you!