CuBLAS gives the same output every time

runarheggset commented 11 months ago

Running the following code with CuBLAS returns the same output every time it's run. Running without CuBLAS returns a different generation, as expected.

package main

import (
    "log"

    "github.com/go-skynet/go-llama.cpp"
)

func main() {
    model := "../models/airoboros-l2-13b-3.1.Q4_K_M.gguf"

    l, err := llama.New(model)
    if err != nil {
        panic(err)
    }
    defer l.Free()

    opts := []llama.PredictOption{
        llama.SetTokens(500),
        llama.SetThreads(20),
        llama.SetTopK(20),
        llama.SetTopP(0.9),
        llama.SetTemperature(0.7),
        llama.SetPenalty(1.15),
    }

    prompt := "Hello"

    text, err := l.Predict(prompt, opts...)
    if err != nil {
        panic(err)
    }

    log.Print(text)
}

Output with CuBLAS: , I'm interested in 10000 W 127th St, Palos Park, IL 60465. Please send me more information about this property.

retme7 commented 10 months ago

I have the save issue. llama.cpp works fine but gollama.cpp with CuBLAS did not.

deep-pipeline commented 9 months ago

Same generation output every time (presumably despite different prompts - you aren't absolutely clear on that) suggests that both the prompt being given via go-llama is being dropped and replaced with some placeholder and probably the temperature for running the prompt is being set to zero..

On the positive side the fact that any response is coming out of it suggests that the execution path is hooked up..

..you just need to find where (and why) in the code execution path specifically for cuBLAS there is a static prompt template, with temp zero. The template prompt will relate to the repeated output you keep seeing.

Sorry, I'm not involved in maintaining project - I was just reading through the issue backlog to get a feel for where the project is at and thought you might find the observation helpful. I did notice that for Metal execution there was an issue which caused a problem (but got addressed) which involved pulling over some ggml-metal file - if I were you I'd first make sure that all code I have locally absolutely matches the latest go-llama code base, then I would have a quick look around for the cuBLAS equivalent in the current go-llama code base and see if there is anything with temp=0 or a template prompt; after that I'd see if I could work out where in the go-llama execution path things fork depending on whether cuBLAS is used or not and I'd try to follow the cuBLAS path to the point where things get handed over to llama.cpp code.. the problem will be somewhere in there! Good luck - and remember, ChatGPT or Claude are your code explorer friends..

go-skynet / go-llama.cpp

CuBLAS gives the same output every time #265