Closed s-kostyaev closed 1 year ago
I see code is updated, so this is output of commands.
Thanks @s-kostyaev, I was actually asking bgonzalezfractal to run it so that I can check and comapre the output on their system as well :)
Since you already built it, can you also run ./build/lib/main
on a starcoder model, because yesterday it was giving an empty response.
Sure.
% ./build/lib/main starcoder ../LocalAI/models/starchat-alpha-ggml-q4_0.bin
model type : 'starcoder'
model path : '../LocalAI/models/starchat-alpha-ggml-q4_0.bin'
prompt : 'Hi'
load ... ✔
tokenize ... ✔
> [ 12575 ]
eval ... ✔
sample ... ✔
> 399
detokenize ... ✔
> ' A'
delete ... ✔
Thanks. So the C++ code works fine natively and doesn't have any issue. I will have to debug why it is failing from Python.
@s-kostyaev I found another issue https://github.com/LibRaw/LibRaw/issues/437#issue-1065648301 which looks similar to the error you posted previously https://github.com/marella/ctransformers/issues/8#issuecomment-1557635980
They mention it to be a stack size limit issue which gets worse with multiple threads.
So can you please try using threads=1
after building from source (I added some print statements):
git clone --recurse-submodules https://github.com/marella/ctransformers
cd ctransformers
git checkout debug
./scripts/build.sh
llm = AutoModelForCausalLM.from_pretrained(..., lib='/path/to/ctransformers/build/lib/libctransformers.dylib')
print(llm('Hi', max_new_tokens=1, threads=1))
Also please run with threads=4
and share both the outputs.
In above thread, they also suggested increasing stack size limit but I'm not sure what an ideal limit would be.
Sure. Will test it.
With single thread:
% python3 test.py
ggml_graph_compute: n_threads = 0
ggml_graph_compute: create thread pool
ggml_graph_compute: initialize tasks + work buffer
ggml_graph_compute: allocating work buffer for graph (26048 bytes)
ggml_graph_compute: compute nodes
And it stucked.
Are you using threads=1
? because it is printing n_threads = 0
!
Can you also please check with threads=4
.
Sure.
% python3 test.py
ggml_graph_compute: n_threads = 0
ggml_graph_compute: create thread pool
ggml_graph_compute: initialize tasks + work buffer
ggml_graph_compute: allocating work buffer for graph (26048 bytes)
ggml_graph_compute: compute nodes
This is with 4 threads set. And even set in 2 places - config and llm eval call.
Thanks. I think I found the issue. I will make a new release and will let you know in sometime.
@marella sorry I've been working like crazy, I see @s-kostyaev executed the necessary commands, if you need anything else from my hardware just let me know, glad you guys found it.
@marella sorry I've been working like crazy, I see @s-kostyaev executed the necessary commands, if you need anything else from my hardware just let me know, glad you guys found it.
No worries @bgonzalezfractal
@s-kostyaev I released a fix in the latest version 0.2.1 Please update:
pip install --upgrade ctransformers
and let me know if it works. Please don't set lib=...
option.
Also please try running with different threads
(1, 4, 8) and let me know if you see any change in performance.
Finally it works. Threads parameter works. It even works with conda now. Thank you!
Thanks a lot @s-kostyaev for helping in debugging the issue.
Trying simple example on m1 mac:
leads to segmentation fault. Model works fine with ggml example code.