Open superchargez opened 1 year ago
I also tried following: I downloaded the model from thebloke (huggingface) and put it in following code:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('/path/to/wizardcoder.bin', model_type='starcoder')
This works in google colab, though only if you enable GPU.
from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained('/path/to/wizardcoder.bin', model_type='starcoder')
Is there a way to run it locallly, WITHOUT GPU please?
Hi, it looks like a memory issue. How much RAM do you have?
I got 16GB RAM and I am using debian 12. So, RAM should not be the issue here. I have it worked in free google colab which provides you 12GB RAM. Also, same model was running fine in kobold (in Windows) on same machine.
Which file did you download from here? Did you use the same file in Google Colab as well? Are you running it in WSL on your machine? Have you tried running on Windows?
File I used is (smallest one there): https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML/resolve/main/wizardcoder-guanaco-15b-v1.0.ggmlv1.q4_0.bin
How much RAM should it use? (I think it can't run in colab, even though it appears that RAM consumption does not reach limit i.e 12GB). When I tried the same file with Kobold on Windows then it worked, however, I got the error when I tried in linux (with ctransformers).
I will test this again in a few days. And report my finding.
Tried to run it with kobold (in linux) and got following error: System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | gpt2_model_load: loading model from '/home/jawad/Downloads/models/wizardcoder-guanaco-15b-v1.0.ggmlv1.q4_0.bin' gpt2_model_load: n_vocab = 49153 gpt2_model_load: n_ctx = 8192 (2048) gpt2_model_load: n_embd = 6144 gpt2_model_load: n_head = 48 gpt2_model_load: n_layer = 40 gpt2_model_load: ftype = 2002 gpt2_model_load: qntvr = 2 gpt2_model_load: ggml ctx size = 17928.72 MB ggml_aligned_malloc: insufficient memory (attempted to allocate 17928.72 MB) GGML_ASSERT: ggml.c:4399: ctx->mem_buffer != NULL
You were right that more memory was required that currently had on system, (as it was trying with almost 18GB), however, this did not happen in Windows for the same model.
Anyway, is there a way to lower memory consumption? How does Windows allow the model to run?
If I give is smaller context window then it may just work. How do I give it smaller context?
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('starcoder.bin', context_length=2000, model_type='starcoder')
print(llm('What is weather like in NY today?'))
Can you please try running ctransformers on Windows and see if it works.
Are you running Linux on WSL? WSL has less memory allocated compared to Windows. I'm guessing on Windows also it requires more memory but using swap when it runs out of memory and on Linux the swap might not be enough. You can try adding more swap on Linux and see if it works.
Currently changing context_length
is not supported for starcoder models. You can try reducing batch_size
:
llm = AutoModelForCausalLM.from_pretrained(..., batch_size=1)
I have only one system with 16GB RAM, which currently has debian 12 on it. I'm not running on WSL, because I think it would require even more RAM. So my current option is only to try with your code above, with batch size of 1. I will try this today and come back with results.
llm = AutoModelForCausalLM.from_pretrained(model_path=model, batch_size=1)
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
@superchargez How can you get this information from Ctransformers? I can't see this information while loading or running the model. Is there any verbose flag?
It was ran in Kobold I wanted to show that memory requirement exceeded that is why it was not working with ctransformers either.
Hi, I am trying to use "TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML", however, I am getting following error:
I get same error with abacaj's replit inference code, though I replaced model type and model in line 48 and 49, and even changed context length to 4444.