Open khanjandharaiya opened 9 months ago
I wonder is BLAS can help in this... I am not sure I am looking for the same. Any suggestion @marella ?
For your reference, i am running below code on my i5 PC without GPU, fast enough :)
modelInUse = "codellama-13b-instruct.ggmlv3.Q4_1.bin"
config = { 'max_new_tokens': 1024, 'repetition_penalty': 1.1, 'temperature': 0.1, 'top_k': 50, 'top_p': 0.9, 'stream': True, 'threads': int(os.cpu_count() / 2), # adjust for your CPU }
llm = AutoModelForCausalLM.from_pretrained( modelInUse, model_type='llama', lib='avx2', #for cpu use **config )
Hello,
This is my first time posting at any repository of GitHub so if i have made any mistake apologies in advance. 🙏
I am using "wizardcoder-python-7b-v1.0.Q4_K_M.gguf" model for generating SQL from the given data schema and natural language question (Approx. 720 tokens for prompt) . It takes around 10-15 seconds to give the output on CPU. How do i improve speed for the inference further for CPU as i don't have access to GPU.
Below is the resources for the system i am using:
Resources CPU: AMD Ryzen 5 5500U with Radeon Graphics
Base speed:2.10 GHz Sockets:1 Cores:6 Logical processors:12 Virtualization: Enabled L1 cache:384 KB L2 cache:3.0 MB L3 cache:8.0 MB RAM: 16 GB
Any help will be appreciated! Thank you! 🙏 🙏