marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.76k stars 137 forks source link

How to increase speed of inference speed for CPU? #157

Open khanjandharaiya opened 9 months ago

khanjandharaiya commented 9 months ago

Hello,

This is my first time posting at any repository of GitHub so if i have made any mistake apologies in advance. 🙏

I am using "wizardcoder-python-7b-v1.0.Q4_K_M.gguf" model for generating SQL from the given data schema and natural language question (Approx. 720 tokens for prompt) . It takes around 10-15 seconds to give the output on CPU. How do i improve speed for the inference further for CPU as i don't have access to GPU.

Below is the resources for the system i am using:

Resources CPU: AMD Ryzen 5 5500U with Radeon Graphics

Base speed:2.10 GHz Sockets:1 Cores:6 Logical processors:12 Virtualization: Enabled L1 cache:384 KB L2 cache:3.0 MB L3 cache:8.0 MB RAM: 16 GB

Any help will be appreciated! Thank you! 🙏 🙏

AayushSameerShah commented 9 months ago

I wonder is BLAS can help in this... I am not sure I am looking for the same. Any suggestion @marella ?

phoenixthinker commented 9 months ago

For your reference, i am running below code on my i5 PC without GPU, fast enough :)

modelInUse = "codellama-13b-instruct.ggmlv3.Q4_1.bin"

config = { 'max_new_tokens': 1024, 'repetition_penalty': 1.1, 'temperature': 0.1, 'top_k': 50, 'top_p': 0.9, 'stream': True, 'threads': int(os.cpu_count() / 2), # adjust for your CPU }

llm = AutoModelForCausalLM.from_pretrained( modelInUse, model_type='llama', lib='avx2', #for cpu use **config )