Closed Naugustogi closed 1 year ago
@Naugustogi, mlock
should be supported, Do you get any errors!
Btw, loading the model does not take much time! it is almost instant now! Why it that annoying ?
@Naugustogi,
mlock
should be supported, Do you get any errors!Btw, loading the model does not take much time! it is almost instant now! Why it that annoying ?
I'm using v 1.0.6 pyllamacpp. It crashed when i'm using use_mlock=True also using f16_kv=1
Also llama_print_timings: load time = 69042.31 ms llama_print_timings: sample time = 14.22 ms / 33 runs ( 0.43 ms per run) llama_print_timings: prompt eval time = 60306.69 ms / 108 tokens ( 558.40 ms per token) llama_print_timings: eval time = 18046.62 ms / 32 runs ( 563.96 ms per run) llama_print_timings: total time = 87104.24 ms
Which is way to slow i think. Model is gpt4 x alpaca 13b. Using 16gb ram, intel core i5 7400
works definitely faster if i'm using the base llama.cpp, i'm getting like 4 tokens/s.
@Naugustogi I think that error is coming from the ggml
library.
Everything is working normally on my side.
Could you please try to build it from source ?
@Naugustogi I think that error is coming from the
ggml
library. Everything is working normally on my side. Could you please try to build it from source ?
I am unable to rebuild and have to rely on other peoples upload. You can close this issue if you want. For now i have to wait for speed modifications, Model loading and staying in ram is fine, it just takes abit time in my case. The initial problem wasn't mlock. I simply mistook the loading time for the model generation time.
@Naugustogi Why you can't rebuild. if you succeeded to run llama.cpp
then the process is straightforward, you only need cmake
, and run pip install from the github repo!
Using Windows, on the other side llama.cpp works as fine with keeping the model in ram. Loading the model each time for use is annoying.