how to resolve this error? trying to run cublas

hiqsociety commented 11 months ago

@MathiasGS, can u help with this pls?

root@ubuntu:/usr/local/src/go-llama.cpp# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../llama.cpp/models/speechless-llama2-13b.Q4_K_M.gguf" -t 14
# github.com/go-skynet/go-llama.cpp
binding.cpp: In function ‘int llama_predict(void*, void*, char*, bool)’:
binding.cpp:332:53: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 2 has type ‘int’ [-Wformat=]
  332 |                 printf("<<input too long: skipped %zu token%s>>", skipped_tokens, skipped_tokens != 1 ? "s" : "");
      |                                                   ~~^             ~~~~~~~~~~~~~~
      |                                                     |             |
      |                                                     |             int
      |                                                     long unsigned int
      |                                                   %u
binding.cpp: In function ‘void llama_binding_free_model(void*)’:
binding.cpp:797:5: warning: possible problem detected in invocation of ‘operator delete’ [-Wdelete-incomplete]
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:797:17: warning: invalid use of incomplete type ‘struct llama_model’
  797 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from ./llama.cpp/common/common.h:5,
                 from binding.cpp:1:
./llama.cpp/llama.h:60:12: note: forward declaration of ‘struct llama_model’
   60 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:797:5: note: neither the destructor nor the class-specific ‘operator delete’ will be called, even if they are declared when the class is defined
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
create_gpt_params: loading model ../llama.cpp/models/speechless-llama2-13b.Q4_K_M.gguf
SIGSEGV: segmentation violation
PC=0x7f4c8d74bfbd m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x49f6e0, 0xc00005ca90)
    /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc00005ca68 sp=0xc00005ca30 pc=0x41522b
github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x2a1e100, 0x80, 0x0, 0x1, 0x0, 0x1, 0x1, 0x0, 0x0, 0x200, ...)
    _cgo_gotypes.go:267 +0x4f fp=0xc00005ca90 sp=0xc00005ca68 pc=0x49c04f
github.com/go-skynet/go-llama%2ecpp.New({0x7ffccaeca64a, 0x35}, {0xc00005ce20, 0x4, 0x1?})
    /usr/local/src/go-llama.cpp/llama.go:39 +0x385 fp=0xc00005cca0 sp=0xc00005ca90 pc=0x49c7a5
main.main()
    /usr/local/src/go-llama.cpp/examples/main.go:37 +0x3bd fp=0xc00005cf40 sp=0xc00005cca0 pc=0x49e93d
runtime.main()
    /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc00005cfe0 sp=0xc00005cf40 pc=0x445c9b
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00005cfe8 sp=0xc00005cfe0 pc=0x46fd21

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004cfa8 sp=0xc00004cf88 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
    /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc00004cfe0 sp=0xc00004cfa8 pc=0x445f73
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004cfe8 sp=0xc00004cfe0 pc=0x46fd21
created by runtime.init.6 in goroutine 1
    /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004d778 sp=0xc00004d758 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
    /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc00004d7c8 sp=0xc00004d778 pc=0x432a14
runtime.gcenable.func1()
    /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc00004d7e0 sp=0xc00004d7c8 pc=0x427da5
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004d7e8 sp=0xc00004d7e0 pc=0x46fd21
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000076000?, 0x59f718?, 0x1?, 0x0?, 0xc0000071e0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004df70 sp=0xc00004df50 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0xa26d60)
    /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00004dfa0 sp=0xc00004df70 pc=0x4302a9
runtime.bgscavenge(0x0?)
    /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00004dfc8 sp=0xc00004dfa0 pc=0x43083c
runtime.gcenable.func2()
    /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc00004dfe0 sp=0xc00004dfc8 pc=0x427d45
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x46fd21
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 18 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000048628 sp=0xc000048608 pc=0x4460ee
runtime.runfinq()
    /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000487e0 sp=0xc000048628 pc=0x426e27
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000487e8 sp=0xc0000487e0 pc=0x46fd21
created by runtime.createfing in goroutine 1
    /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x0
rbx    0x2a1e9a8
rcx    0x7f4c83419c80
rdx    0x0
rdi    0x2a1e9a8
rsi    0x7ffccaec8ae0
rbp    0x7ffccaec8ae0
rsp    0x7ffccaec87c0
r8     0x57
r9     0x2a1ebe0
r10    0x7f4c8d60d258
r11    0x7f4c83419ce0
r12    0x0
r13    0x0
r14    0x2a1e9b8
r15    0x7ffccaec8ac0
rip    0x7f4c8d74bfbd
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0
exit status 2
root@ubuntu:/usr/local/src/go-llama.cpp#

maxjust commented 11 months ago

same poblem

hiqsociety commented 11 months ago

@maxjust if u found the solution, pls post here. thx

gsiehien commented 11 months ago

Hi!

The issue arises in the common.cpp file after applying the 1902-cuda.patch, which introduces functions create_gpt_params() and load_binding_model(). For some reason, merely adding an fprintf() somewhere within the create_gpt_params() function causes everything to work as intended.

gpt_params* create_gpt_params(const std::string& fname,const std::string& lora,const std::string& lora_base) {
   gpt_params* lparams = new gpt_params;
    fprintf(stderr, "%s: loading model %s\n", __func__, fname.c_str());

    // Initialize the 'model' member with the 'fname' parameter
    lparams->model = fname;
    lparams->lora_base = lora_base;
    lparams->lora_adapter = lora;
    if (lparams->lora_adapter.empty()) {
        fprintf(stderr, "no lora, disable mmap"); // <---- ?
        lparams->use_mmap = false;
    }

    return lparams;
}

And yes - on the surface, this doesn't make any sense. However, it's likely masking the root cause of the problem, at least on my machine (and as such should not be treated as a solution). It might influence compiler optimizations or the memory state at a given moment.

Something is probably happening around the setting of the LORA stuff in gpt_params and its transfer from function to function. I'm still trying to figure it out.

mudler commented 11 months ago

I have already had such issues in the past - that's the whole point of having the patch (I would have avoided at all, if possible). I've opened up a PR upstream trying to fix this in the correct way, but it was rejected due to code style. https://github.com/ggerganov/llama.cpp/pull/1902. The copy-to-value all over the code seem to trigger misalignment of structures on different combinations of toolchains, triggering this.

It looks a combination of nvcc version + gcc + go to trigger this - I used valgrind as well to debug this in the past to carefully trying to see the culprit, but there is nothing actually that seems to indicate what's behind the real issue code-wise, so we are back at hacks all the way long.

I'll try to reproduce this on a GPU, however it really needs time and patience to play with valgrind and alikes.

hiqsociety commented 11 months ago

@gsiehien it's not working. @mudler can u provide a fix?

edited common.cpp

gpt_params* create_gpt_params(const std::string& fname,const std::string& lora,const std::string& lora_base) {
   gpt_params* lparams = new gpt_params;
    fprintf(stderr, "%s: loading model %s\n", __func__, fname.c_str());

    // Initialize the 'model' member with the 'fname' parameter
    lparams->model = fname;
    lparams->lora_base = lora_base;
    lparams->lora_adapter = lora;
    if (lparams->lora_adapter.empty()) {
        fprintf(stderr, "no lora, disable mmap"); // <---- ?
        lparams->use_mmap = false;
    }

    return lparams;
}

even edited the patch. nothing works.

diff --git a/common/common.cpp b/common/common.cpp
index 3138213..af93a32 100644
--- a/common/common.cpp
+++ b/common/common.cpp
@@ -1257,3 +1257,83 @@ void dump_non_result_info_yaml(FILE * stream, const gpt_params & params, const l
     fprintf(stream, "typical_p: %f # default: 1.0\n", params.typical_p);
     fprintf(stream, "verbose_prompt: %s # default: false\n", params.verbose_prompt ? "true" : "false");
 }
+
+gpt_params* create_gpt_params(const std::string& fname,const std::string& lora,const std::string& lora_base) {
+   gpt_params* lparams = new gpt_params;
+    fprintf(stderr, "%s: loading model %s\n", __func__, fname.c_str());
+
+    // Initialize the 'model' member with the 'fname' parameter
+    lparams->model = fname;
+    lparams->lora_base = lora_base;
+    lparams->lora_adapter = lora;
+    if (lparams->lora_adapter.empty()) {
+        fprintf(stderr, "no lora, disable mmap"); // <---- ?
+        lparams->use_mmap = false;
+    }
+
+    return lparams;
+}
+

error


root@ubuntu:/usr/local/src/go-llama.cpp# BUILD_TYPE=cublas make libbinding.a
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I./llama.cpp -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -march=native -mtune=native
I CXXFLAGS: -I./llama.cpp -I. -I./llama.cpp/common -I./common -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I CGO_LDFLAGS:  
I LDFLAGS:  
I BUILD_TYPE:  cublas
I CMAKE_ARGS:  -DLLAMA_CUBLAS=ON
I EXTRA_TARGETS:  llama.cpp/ggml-cuda.o
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make: 'libbinding.a' is up to date.

root@ubuntu:/usr/local/src/go-llama.cpp# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../llama.cpp/models/speechless-llama2-13b.Q3_K_S.gguf" -t 14 -ngl 1
# github.com/go-skynet/go-llama.cpp
binding.cpp: In function ‘int llama_predict(void*, void*, char*, bool)’:
binding.cpp:332:53: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 2 has type ‘int’ [-Wformat=]
  332 |                 printf("<<input too long: skipped %zu token%s>>", skipped_tokens, skipped_tokens != 1 ? "s" : "");
      |                                                   ~~^             ~~~~~~~~~~~~~~
      |                                                     |             |
      |                                                     |             int
      |                                                     long unsigned int
      |                                                   %u
binding.cpp: In function ‘void llama_binding_free_model(void*)’:
binding.cpp:797:5: warning: possible problem detected in invocation of ‘operator delete’ [-Wdelete-incomplete]
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:797:17: warning: invalid use of incomplete type ‘struct llama_model’
  797 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from ./llama.cpp/common/common.h:5,
                 from binding.cpp:1:
./llama.cpp/llama.h:60:12: note: forward declaration of ‘struct llama_model’
   60 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:797:5: note: neither the destructor nor the class-specific ‘operator delete’ will be called, even if they are declared when the class is defined
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
create_gpt_params: loading model ../llama.cpp/models/speechless-llama2-13b.Q3_K_S.gguf
SIGSEGV: segmentation violation
PC=0x7fa9b374bfbd m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x49f6e0, 0xc000061a90)
    /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000061a68 sp=0xc000061a30 pc=0x41522b
github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x2926100, 0x80, 0x0, 0x1, 0x0, 0x1, 0x1, 0x0, 0x1, 0x200, ...)
    _cgo_gotypes.go:267 +0x4f fp=0xc000061a90 sp=0xc000061a68 pc=0x49c04f
github.com/go-skynet/go-llama%2ecpp.New({0x7fff3c91d642, 0x35}, {0xc000061e20, 0x4, 0x1?})
    /usr/local/src/go-llama.cpp/llama.go:39 +0x385 fp=0xc000061ca0 sp=0xc000061a90 pc=0x49c7a5
main.main()
    /usr/local/src/go-llama.cpp/examples/main.go:37 +0x3bd fp=0xc000061f40 sp=0xc000061ca0 pc=0x49e93d
runtime.main()
    /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc000061fe0 sp=0xc000061f40 pc=0x445c9b
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000061fe8 sp=0xc000061fe0 pc=0x46fd21

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004cfa8 sp=0xc00004cf88 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
    /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc00004cfe0 sp=0xc00004cfa8 pc=0x445f73
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004cfe8 sp=0xc00004cfe0 pc=0x46fd21
created by runtime.init.6 in goroutine 1
    /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004d778 sp=0xc00004d758 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
    /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc00004d7c8 sp=0xc00004d778 pc=0x432a14
runtime.gcenable.func1()
    /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc00004d7e0 sp=0xc00004d7c8 pc=0x427da5
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004d7e8 sp=0xc00004d7e0 pc=0x46fd21
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000076000?, 0x59f718?, 0x1?, 0x0?, 0xc0000071e0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004df70 sp=0xc00004df50 pc=0x4460ee
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0xa26d60)
    /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00004dfa0 sp=0xc00004df70 pc=0x4302a9
runtime.bgscavenge(0x0?)
    /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00004dfc8 sp=0xc00004dfa0 pc=0x43083c
runtime.gcenable.func2()
    /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc00004dfe0 sp=0xc00004dfc8 pc=0x427d45
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x46fd21
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004e628 sp=0xc00004e608 pc=0x4460ee
runtime.runfinq()
    /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc00004e7e0 sp=0xc00004e628 pc=0x426e27
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004e7e8 sp=0xc00004e7e0 pc=0x46fd21
created by runtime.createfing in goroutine 1
    /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x0
rbx    0x29269a8
rcx    0x7fa9a9419c80
rdx    0x0
rdi    0x29269a8
rsi    0x7fff3c91c5a0
rbp    0x7fff3c91c5a0
rsp    0x7fff3c91c280
r8     0x57
r9     0x2926be0
r10    0x7fa9b360d258
r11    0x7fa9a9419ce0
r12    0x0
r13    0x0
r14    0x29269b8
r15    0x7fff3c91c580
rip    0x7fa9b374bfbd
rflags 0x10246
cs     0x33
fs     0x0
gs     0x0

mudler commented 11 months ago

@gsiehien it's not working. @mudler can u provide a fix?

I will have a look at it as soon as I have some free cycles. As I mentioned already, it takes a bit of time by using valgrind and alikes. If someone else meanwhile wants to take a stab at it, go ahead.

hiqsociety commented 11 months ago

@mudler ok thx. can i buy u some coffee so you can speed up your cycles? pls help fix this. will buy u coffees for this. been waiting for a long time. other solutions to work with llama.cpp doesnt work e.g. gpt-llama.cpp.

pls help. thx!

mudler commented 11 months ago

@hiqsociety can you try https://github.com/go-skynet/go-llama.cpp/pull/224 ?

Clone again the repo with the specific branch:

git clone --recurse-submodules https://github.com/go-skynet/go-llama.cpp --branch workaround_cuda_bugs

mudler commented 11 months ago

PR was merged, try master please

gsiehien commented 11 months ago

@mudler - I can confirm that it works. It's nice to have the workaround in the master branch, as it simplifies the builds. Thanks!

mudler commented 11 months ago

@mudler - I can confirm that it works. It's nice to have the workaround in the master branch, as it simplifies the builds. Thanks!

Awesome ! Thanks for confirming 👍

hiqsociety commented 11 months ago

@mudler you are a lifesaver! where can i buy u a coffee? really appreciate the work on this.

hiqsociety commented 11 months ago

@mudler everything compiles and "ran" but ran with errors below. how to do remedy?

pls check thx.

llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 5429 MB
...................................................................................................
llama_new_context_with_model: kv self size  =  100.00 MB
llama_new_context_with_model: compute buffer total size =   22.47 MB
llama_new_context_with_model: VRAM scratch buffer: 21.00 MB
Model loaded successfully.
>>> write an article on elon musk

Sending write an article on elon musk

 write an article on elon musk
SIGSEGV: segmentation violation
PC=0x51eed0 m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x49f540, 0xc000061968)
    /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000061940 sp=0xc000061908 pc=0x41510b
github.com/go-skynet/go-llama%2ecpp._Cfunc_llama_predict(0x4e5c1590, 0x252c7a0, 0xc00009c000, 0x1)
    _cgo_gotypes.go:236 +0x4b fp=0xc000061968 sp=0xc000061940 pc=0x49be2b
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict.func2(0x57dfa0?, 0xc000061b70?, {0xc00009c000, 0x1?, 0xc000061a28?}, 0xc00001e080?)
    /usr/local/src/go-llama.cpp/llama.go:312 +0x98 fp=0xc0000619b8 sp=0xc000061968 pc=0x49d898
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict(0xc000012018, {0xc00001e080, 0x1e}, {0xc000061e40, 0x8, 0x0?})
    /usr/local/src/go-llama.cpp/llama.go:312 +0x28f fp=0xc000061ca0 sp=0xc0000619b8 pc=0x49d54f
main.main()
    /usr/local/src/go-llama.cpp/examples/main.go:49 +0x81a fp=0xc000061f40 sp=0xc000061ca0 pc=0x49ec7a
runtime.main()
    /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc000061fe0 sp=0xc000061f40 pc=0x445b7b
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000061fe8 sp=0xc000061fe0 pc=0x46fc01

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004cfa8 sp=0xc00004cf88 pc=0x445fce
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
    /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc00004cfe0 sp=0xc00004cfa8 pc=0x445e53
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004cfe8 sp=0xc00004cfe0 pc=0x46fc01
created by runtime.init.6 in goroutine 1
    /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004d778 sp=0xc00004d758 pc=0x445fce
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
    /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc00004d7c8 sp=0xc00004d778 pc=0x4328f4
runtime.gcenable.func1()
    /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc00004d7e0 sp=0xc00004d7c8 pc=0x427c85
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004d7e8 sp=0xc00004d7e0 pc=0x46fc01
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000076000?, 0x59e718?, 0x1?, 0x0?, 0xc0000071e0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004df70 sp=0xc00004df50 pc=0x445fce
runtime.goparkunlock(...)
    /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0xa25d60)
    /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00004dfa0 sp=0xc00004df70 pc=0x430189
runtime.bgscavenge(0x0?)
    /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00004dfc8 sp=0xc00004dfa0 pc=0x43071c
runtime.gcenable.func2()
    /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc00004dfe0 sp=0xc00004dfc8 pc=0x427c25
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x46fc01
created by runtime.gcenable in goroutine 1
    /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
    /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004e628 sp=0xc00004e608 pc=0x445fce
runtime.runfinq()
    /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc00004e7e0 sp=0xc00004e628 pc=0x426d07
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004e7e8 sp=0xc00004e7e0 pc=0x46fc01
created by runtime.createfing in goroutine 1
    /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x6
rbx    0x7fff644530e0
rcx    0x0
rdx    0x7d00
rdi    0x0
rsi    0x0
rbp    0x7fff64452f40
rsp    0x7fff64452e60
r8     0x4e12f840
r9     0x7fff64453180
r10    0x0
r11    0x0
r12    0x7fff64453440
r13    0xa
r14    0x4e12f840
r15    0x4d93ce10
rip    0x51eed0
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
exit status 2
root@ubuntu:/usr/local/src/go-llama.cpp# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../llama.cpp/models/speechless-llama2-13b.Q3_K_S.gguf" -t 10

hiqsociety commented 11 months ago

@mudler if i dont use the llama.cpp downloaded automatically but changed my own "working" llama.cpp", will get this instead. hope to use the one pushed automatically if possible.

root@ubuntu:/usr/local/src/go-llama.cpp# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../llama.cpp/models/speechless-llama2-13b.Q3_K_S.gguf" -t 2 -ngl 43
# github.com/go-skynet/go-llama.cpp
binding.cpp: In function ‘int llama_predict(void*, void*, char*, bool)’:
binding.cpp:332:53: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 2 has type ‘int’ [-Wformat=]
  332 |                 printf("<<input too long: skipped %zu token%s>>", skipped_tokens, skipped_tokens != 1 ? "s" : "");
      |                                                   ~~^             ~~~~~~~~~~~~~~
      |                                                     |             |
      |                                                     |             int
      |                                                     long unsigned int
      |                                                   %u
binding.cpp: In function ‘void llama_binding_free_model(void*)’:
binding.cpp:797:5: warning: possible problem detected in invocation of ‘operator delete’ [-Wdelete-incomplete]
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:797:17: warning: invalid use of incomplete type ‘struct llama_model’
  797 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from ./llama.cpp/common/common.h:5,
                 from binding.cpp:1:
./llama.cpp/llama.h:60:12: note: forward declaration of ‘struct llama_model’
   60 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:797:5: note: neither the destructor nor the class-specific ‘operator delete’ will be called, even if they are declared when the class is defined
  797 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
# github.com/go-skynet/go-llama.cpp/examples
/usr/local/go/pkg/tool/linux_amd64/link: running g++ failed: exit status 1
/usr/bin/ld: /tmp/go-link-439490360/000002.o: in function `load_model':
/usr/local/src/go-llama.cpp/binding.cpp:946: undefined reference to `load_binding_model(char const*, int, int, bool, bool, bool, bool, bool, int, int, char const*, char const*, bool, float, float, bool, char const*, char const*, bool)'
collect2: error: ld returned 1 exit status

mudler commented 11 months ago

@mudler you are a lifesaver! where can i buy u a coffee? really appreciate the work on this.

either https://github.com/sponsors/mudler or https://www.buymeacoffee.com/mudler works

@mudler everything compiles and "ran" but ran with errors below. how to do remedy?

pls check thx.

llm_load_tensors: offloading k cache to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 5429 MB
...................................................................................................
llama_new_context_with_model: kv self size  =  100.00 MB
llama_new_context_with_model: compute buffer total size =   22.47 MB
llama_new_context_with_model: VRAM scratch buffer: 21.00 MB
Model loaded successfully.
>>> write an article on elon musk

Sending write an article on elon musk

 write an article on elon musk
SIGSEGV: segmentation violation
PC=0x51eed0 m=0 sigcode=1
signal arrived during cgo execution

goroutine 1 [syscall]:
runtime.cgocall(0x49f540, 0xc000061968)
  /usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc000061940 sp=0xc000061908 pc=0x41510b
github.com/go-skynet/go-llama%2ecpp._Cfunc_llama_predict(0x4e5c1590, 0x252c7a0, 0xc00009c000, 0x1)
  _cgo_gotypes.go:236 +0x4b fp=0xc000061968 sp=0xc000061940 pc=0x49be2b
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict.func2(0x57dfa0?, 0xc000061b70?, {0xc00009c000, 0x1?, 0xc000061a28?}, 0xc00001e080?)
  /usr/local/src/go-llama.cpp/llama.go:312 +0x98 fp=0xc0000619b8 sp=0xc000061968 pc=0x49d898
github.com/go-skynet/go-llama%2ecpp.(*LLama).Predict(0xc000012018, {0xc00001e080, 0x1e}, {0xc000061e40, 0x8, 0x0?})
  /usr/local/src/go-llama.cpp/llama.go:312 +0x28f fp=0xc000061ca0 sp=0xc0000619b8 pc=0x49d54f
main.main()
  /usr/local/src/go-llama.cpp/examples/main.go:49 +0x81a fp=0xc000061f40 sp=0xc000061ca0 pc=0x49ec7a
runtime.main()
  /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc000061fe0 sp=0xc000061f40 pc=0x445b7b
runtime.goexit()
  /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000061fe8 sp=0xc000061fe0 pc=0x46fc01

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004cfa8 sp=0xc00004cf88 pc=0x445fce
runtime.goparkunlock(...)
  /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
  /usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc00004cfe0 sp=0xc00004cfa8 pc=0x445e53
runtime.goexit()
  /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004cfe8 sp=0xc00004cfe0 pc=0x46fc01
created by runtime.init.6 in goroutine 1
  /usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004d778 sp=0xc00004d758 pc=0x445fce
runtime.goparkunlock(...)
  /usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
  /usr/local/go/src/runtime/mgcsweep.go:280 +0x94 fp=0xc00004d7c8 sp=0xc00004d778 pc=0x4328f4
runtime.gcenable.func1()
  /usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc00004d7e0 sp=0xc00004d7c8 pc=0x427c85
runtime.goexit()
  /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004d7e8 sp=0xc00004d7e0 pc=0x46fc01
created by runtime.gcenable in goroutine 1
  /usr/local/go/src/runtime/mgc.go:200 +0x66

goroutine 4 [GC scavenge wait]:
runtime.gopark(0xc000076000?, 0x59e718?, 0x1?, 0x0?, 0xc0000071e0?)
  /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004df70 sp=0xc00004df50 pc=0x445fce
runtime.goparkunlock(...)
  /usr/local/go/src/runtime/proc.go:404
runtime.(*scavengerState).park(0xa25d60)
  /usr/local/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00004dfa0 sp=0xc00004df70 pc=0x430189
runtime.bgscavenge(0x0?)
  /usr/local/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00004dfc8 sp=0xc00004dfa0 pc=0x43071c
runtime.gcenable.func2()
  /usr/local/go/src/runtime/mgc.go:201 +0x25 fp=0xc00004dfe0 sp=0xc00004dfc8 pc=0x427c25
runtime.goexit()
  /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004dfe8 sp=0xc00004dfe0 pc=0x46fc01
created by runtime.gcenable in goroutine 1
  /usr/local/go/src/runtime/mgc.go:201 +0xa5

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00004e628 sp=0xc00004e608 pc=0x445fce
runtime.runfinq()
  /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc00004e7e0 sp=0xc00004e628 pc=0x426d07
runtime.goexit()
  /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00004e7e8 sp=0xc00004e7e0 pc=0x46fc01
created by runtime.createfing in goroutine 1
  /usr/local/go/src/runtime/mfinal.go:163 +0x3d

rax    0x6
rbx    0x7fff644530e0
rcx    0x0
rdx    0x7d00
rdi    0x0
rsi    0x0
rbp    0x7fff64452f40
rsp    0x7fff64452e60
r8     0x4e12f840
r9     0x7fff64453180
r10    0x0
r11    0x0
r12    0x7fff64453440
r13    0xa
r14    0x4e12f840
r15    0x4d93ce10
rip    0x51eed0
rflags 0x10206
cs     0x33
fs     0x0
gs     0x0
exit status 2
root@ubuntu:/usr/local/src/go-llama.cpp# CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "../llama.cpp/models/speechless-llama2-13b.Q3_K_S.gguf" -t 10

that should be fixed now in https://github.com/go-skynet/go-llama.cpp/pull/228

hiqsociety commented 11 months ago

can run but i cant seem to generate the same amount of context size tokens as without using golang. why?

with 4060 rtx, i can do 1920 max tokens using pure llama.cpp cuda offload 100%

on go-llama, i can only do around ctx size of 650 without oom

@mudler do u know why? how do i fix this?

mudler commented 11 months ago

Probably the batch size - have a look at the llama parameters and check out what are you setting when using go-llama.

hiqsociety commented 11 months ago

@mudler u are right! it is. now i have another question... the output of llama.cpp and go-llama.cpp is not the same. i dont mean the seed random. i mean as in it's not showing the same way. mirostat = 2, temp 0.3 <- set already so i'm not very sure what else needs to match.

do u have any ideas / clue which setting i have to set it to be "exact" same with this used in llama.cpp?

/main -m models/speechless-llama2-13b.Q3_K_S.gguf -ngl 43 -n 3400 -c 1920 -p "write an article on elon musk." --temp 0.3 --mirostat 2 -t 1

go-skynet / go-llama.cpp

how to resolve this error? trying to run cublas #218