marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.76k stars 137 forks source link

How to use wizard coder #55

Open superchargez opened 1 year ago

superchargez commented 1 year ago

Hi, I am trying to use "TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML", however, I am getting following error:

GGML_ASSERT: /home/runner/work/ctransformers/ctransformers/models/ggml/ggml.c:4103: ctx->mem_buffer != NULL
Aborted

I get same error with abacaj's replit inference code, though I replaced model type and model in line 48 and 49, and even changed context length to 4444.

superchargez commented 1 year ago

I also tried following: I downloaded the model from thebloke (huggingface) and put it in following code:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('/path/to/wizardcoder.bin', model_type='starcoder')
superchargez commented 1 year ago

This works in google colab, though only if you enable GPU.

from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained('/path/to/wizardcoder.bin', model_type='starcoder')

Is there a way to run it locallly, WITHOUT GPU please?

marella commented 1 year ago

Hi, it looks like a memory issue. How much RAM do you have?

superchargez commented 1 year ago

I got 16GB RAM and I am using debian 12. So, RAM should not be the issue here. I have it worked in free google colab which provides you 12GB RAM. Also, same model was running fine in kobold (in Windows) on same machine.

marella commented 1 year ago

Which file did you download from here? Did you use the same file in Google Colab as well? Are you running it in WSL on your machine? Have you tried running on Windows?

superchargez commented 11 months ago

File I used is (smallest one there): https://huggingface.co/TheBloke/WizardCoder-Guanaco-15B-V1.0-GGML/resolve/main/wizardcoder-guanaco-15b-v1.0.ggmlv1.q4_0.bin

How much RAM should it use? (I think it can't run in colab, even though it appears that RAM consumption does not reach limit i.e 12GB). When I tried the same file with Kobold on Windows then it worked, however, I got the error when I tried in linux (with ctransformers).

I will test this again in a few days. And report my finding.

superchargez commented 11 months ago

Tried to run it with kobold (in linux) and got following error: System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | gpt2_model_load: loading model from '/home/jawad/Downloads/models/wizardcoder-guanaco-15b-v1.0.ggmlv1.q4_0.bin' gpt2_model_load: n_vocab = 49153 gpt2_model_load: n_ctx = 8192 (2048) gpt2_model_load: n_embd = 6144 gpt2_model_load: n_head = 48 gpt2_model_load: n_layer = 40 gpt2_model_load: ftype = 2002 gpt2_model_load: qntvr = 2 gpt2_model_load: ggml ctx size = 17928.72 MB ggml_aligned_malloc: insufficient memory (attempted to allocate 17928.72 MB) GGML_ASSERT: ggml.c:4399: ctx->mem_buffer != NULL

You were right that more memory was required that currently had on system, (as it was trying with almost 18GB), however, this did not happen in Windows for the same model.

Anyway, is there a way to lower memory consumption? How does Windows allow the model to run?

superchargez commented 11 months ago

If I give is smaller context window then it may just work. How do I give it smaller context?

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('starcoder.bin', context_length=2000, model_type='starcoder')

print(llm('What is weather like in NY today?'))
marella commented 11 months ago

Can you please try running ctransformers on Windows and see if it works.

Are you running Linux on WSL? WSL has less memory allocated compared to Windows. I'm guessing on Windows also it requires more memory but using swap when it runs out of memory and on Linux the swap might not be enough. You can try adding more swap on Linux and see if it works.

Currently changing context_length is not supported for starcoder models. You can try reducing batch_size:

llm = AutoModelForCausalLM.from_pretrained(..., batch_size=1)
superchargez commented 11 months ago

I have only one system with 16GB RAM, which currently has debian 12 on it. I'm not running on WSL, because I think it would require even more RAM. So my current option is only to try with your code above, with batch size of 1. I will try this today and come back with results.

llm = AutoModelForCausalLM.from_pretrained(model_path=model, batch_size=1)
AayushSameerShah commented 9 months ago

System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

@superchargez How can you get this information from Ctransformers? I can't see this information while loading or running the model. Is there any verbose flag?

superchargez commented 9 months ago

It was ran in Kobold I wanted to show that memory requirement exceeded that is why it was not working with ctransformers either.