iangitonga / tinyllama.cpp

A C++ implementation of tinyllama inference on CPU.
MIT License
5 stars 1 forks source link

Hello, can this code run on Macbook with ARM chip? #5

Open zss205 opened 5 months ago

iangitonga commented 5 months ago

Hi, the code should perfectly run on any arm chip. I once compiled it on Android and it ran ok. Let me check if I need to remove any GCC-specific things in the codebase. I'll get back to you when I complete the check.

zss205 commented 5 months ago

Hi, the code should perfectly run on any arm chip. I once compiled it on Android and it ran ok. Let me check if I need to remove any GCC-specific things in the codebase. I'll get back to you when I complete the check.

OK, thank you, looking forward to your reply. There is another question. I tested that the speed is slower than ggml. Can it be further optimized? Any suggestions?

iangitonga commented 5 months ago

The reason it is slower is because ggml utilizes CPU-specific vectorization utilities for Intel(X86) and Apple silicon chips. When I was developing this codebase, my priority was for it to work rather than have state-of-the-art performance on CPUs. As such, I only managed to implement AVX1 utilities for X86 chips and zero optimization for Apple silicon. By comparison, ggml implements the latest AVX512 utilities which perform way better than AVX1 since they were specifically designed to run neural network code.

If you wish, feel free to implement AVX512 or Apple silicon utilities. If you need any help, I can assist. I am cleaning the code to make it more readable and also allow it to be compiled on GCC/Clang/MSVC without any issues.

zss205 commented 4 months ago

Awesome, looking forward to your results. I'm going to integrate ggml into this framework and would love to discuss it with you. Now there is another question. I found that when compiling on different computers, there will be repeated output and the problem that the results cannot be truncated, such as the Linux platform or the Intel chip of the MacBook. Do you have any suggestions?

iangitonga commented 4 months ago

I am not sure why that would happen. It usually runs ok on my Linux Intel system. One issue that would cause that to happen is using 4-bit model. The fp16 model is the most stable since it is the exact model from TinyLLama. The 8-bit and 4-bit models lose a good amount of accuracy (model perplexity) due to quantization. Quantization works well for much larger models that TinyLLama(1.1B), for instance models with 3B, 7B params, etc. if the issue happens on the fp16 model, send me a screenshot of the input and output.

Also, note that the max number of tokens we predict is capped at 768, the max for the model is 2048. Basically, after the model predicts 768 tokens it stops and thus the output might seem truncated. I will increase it to at least 1024 tokens which should be sufficient for most prompts and if the issue persists I will increase it to max value of 2048 but the memory usage will shoot up.