tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

Segmentation falut error when it is built as binary #13

Closed 3DAlgoLab closed 12 months ago

3DAlgoLab commented 1 year ago

Thanks for your fatastic project. For curiosity, I tried to build it as binary. It seems to be built at first. But it didn't work. It showed a message like set python path. But after I set its environment variable, a segmentation falut error occurred. I think it came from mojo builder, maybe. My enviroment is wsl in Windows 11.

antmikinka commented 1 year ago

I am running into the same issue as well. I seen that his llama2.c code and the export.py and tokenizer.py files are how the .bin files were generated for this llama2.mojo project. I created my own .bin files (tokenizer and model) from huggingface (openlm-research/open_llama_3b_v2). Ubuntu terimal:

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  13706713628
771
774
[199253:199253:20230913,213623.555832:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[199253:199253:20230913,213623.556029:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005595d8269717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005595d82672ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005595d8269def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f68165df520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f678400758a
Segmentation fault

Below is my main function in mojo:

fn main() raises:
    print("num hardware threads: ", num_cores())
    print("SIMD vector width: ", nelts)
    var tokenizer = StringRef("tokenizer_open_llama_3b_v2.bin")
    var checkpoint = StringRef("open_llama3bv2.bin")
    var temperature = 0.9
    var steps = 256
    var prompt = String("")
    var rng_seed: Int = time.now()

    @parameter
    fn argparse() raises -> Int:
        let args = argv()
        if len(args) < 2:
            return 0
        checkpoint = args[1]
        for i in range(2, len(args), 2):
            if args[i] == "-p":
                print("Option not supported: ", args[i])
            if args[i] == "-n":
                steps = atol(args[i + 1])
            if args[i] == "-s":
                rng_seed = atol(args[i + 1])
            if args[i] == "-i":
                prompt = args[i + 1]
            if args[i] == "-t":
                let val = args[i + 1]
                temperature = 0.0
                # hacky parse float, keep only 1 digit
                for c in range(0, len(val)):
                    if val[c] == ".":
                        temperature += atol(val[c + 1]) * 0.1
                        break
                    else:
                        temperature = atol(val[c])
                if temperature < -1e9 or temperature > (1 + 1e9):
                    print("Wrong temperature value", temperature)
                    return 0
        return 1

    let res = argparse()
    if res == 0:
        print_usage()
        return

    random.seed(rng_seed)
    var fbuf: FileBuf = FileBuf()
    var tbuf: FileBuf = FileBuf()
    var config: Config = Config()

    read_file(checkpoint, fbuf)
    print("checkpoint size: ", fbuf.size)
    config_init(config, fbuf)

    # negative vocab size is hacky way of signaling unshared weights. bit yikes.
    let shared_weights = 1 if config.vocab_size > 0 else 0
    config.vocab_size = (
        -config.vocab_size if config.vocab_size < 0 else config.vocab_size
    )

    let weights: TransformerWeights = TransformerWeights(config, shared_weights, fbuf)

    var tok: Tokenizer = Tokenizer(config.vocab_size)

    if steps <= 0 or steps > config.seq_len:
        steps = config.seq_len

    print("771")

    # Read in the tokenizer.bin file
    read_file(tokenizer, tbuf)
    print("774")
    tokenizer_init(tok, tbuf)
    print("776")
    # Create and initialize the application RunState
    var state = RunState(config)

    # Process the prompt, if any
    var prompt_tokens = DynamicVector[Int]()

    if prompt:
        bpe_encode(prompt_tokens, prompt, tok)

    # Start the main loop
    var start = 0  # Used to time our code, only initialized after the first iteration
    var next_token = 0  # Will store the next token in the sequence
    # Initialize with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
    var token = 1

    # Position in the sequence
    var pos = 0
    while pos < steps:
        # Forward the transformer to get logits for the next token
        transformer(token, pos, config, state, weights)

        if pos < len(prompt_tokens):
            next_token = prompt_tokens[pos]
        else:
            # Sample the next token
            if temperature == 0.0:
                # Greedy argmax sampling: take the token with the highest probability
                next_token = argmax(state.logits)
            else:
                # Apply the temperature to the logits
                for q in range(config.vocab_size):
                    state.logits[q] = state.logits[q] / temperature
                # Apply softmax to the logits to get the probabilities for the next token
                softmax(state.logits.data, config.vocab_size)
                # Sample from this distribution to get the next token
                next_token = sample(state.logits)

        var token_str: PointerString = tok.vocab[next_token]
        if token == 1 and token_str[0] == ord(" "):
            token_str = token_str.offset(1)

        print_str(token_str)

        # Advance forward
        token = next_token
        pos += 1

        if start == 0:
            start = time_in_ms()

    let end = time_in_ms()
    print("\nachieved tok/s: ", (steps - 1) / (end - start) * 1000)

I was able to narrow down the program to the tokenizer_init function, but I am unsure if the tokor tbufmay be also causing an issue. I tried the native tokenizer, tried mine, both did not work with the model.bin I made.

Below is the tokenizer from open_llama_3b_v2 I had made, using with the stories15M.bin

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo stories15M.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  60816028
771
774
776
ndpl   ~接 whatsoever special manan Ar \\‚ ch Lawreen con provided suchルたined asere -ories<0x0A>点 // f C Beyondreen Brerendumstiew Spanish‚ aroundondular().stqu trad接oplete  yl on something ge==ionan{\‚ con shopping化ownvequir accomplan increase Creen接 conוeddingized siature F接ve //ndacon获 care接 C nyah‚能<0x0A>rib   oegel asym Cou reportily Justethelessopleought meas接pe reducingan on something ge== C ivle flow‚ve //nd Cont doesnionREportiff ten favoriteoard‚ve //nd C asymormersOffset For or freezer oran on something ge== Cacon获 metJs poweranplements‚ chplaceholderily Johnny接ie likedportte care接 lease Donst PerfectronanStore leastah‚rafteliew poemsiff="_iewns Can describesportiff president‚
<s>
ur Football C have gameilyionandfrac de_{unchmundefined‚ur Footballiff class Wrong C vis week asstAT
achieved tok/s:  233.08957952468006

I am on Windows 10, Ubuntu 22.04. I may have incorrectly configured the tokenizer.py or export.py, although, I am not sure. If those files were made not auto configured and needed to have hardcoded values, then they would be messed up.

antmikinka commented 1 year ago

While I was messing around with swapping tokenizers and models, I also did notice this error before I installed llvm on my Ubuntu env (llvm-symbolizer). I did install via sudo apt-get install llvm. Below is the code before I install this, still a segmentation fault.

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  13706713628
771
774
776
[198120:198120:20230913,211303.365100:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[198120:198120:20230913,211303.365169:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  mojo      0x000055d62938d717
1  mojo      0x000055d62938b2ee
2  mojo      0x000055d62938ddef
3  libc.so.6 0x00007fba03a90520
4  libc.so.6 0x00007fb96c00758a
Segmentation fault
antmikinka commented 1 year ago

Just finished trying another model from HF (teknium/OpenHermes-13B). Only created the model using export.py. I noticed that the first time I created one, did not use any version, meaning version used was 0, the legacy version with no header.

I tried version 1 model, created a 50gb file, instantly did not like that. Terminal threw this error.

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0
num hardware threads:  12
SIMD vector width:  16
[1268344:1268344:20230914,000242.889217:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[1268344:1268344:20230914,000242.889286:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0
#0 0x000055bd43817717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x000055bd438152ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x000055bd43817def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007fbae3b84520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007fbae3ce2af7 (/lib/x86_64-linux-gnu/libc.so.6+0x1a0af7)
#5 0x00007fba4c003849
Segmentation fault

I then tried the version 2, where weights are quantized. This took the 26.6gb .bin model file from HF down to 13.3gb roughly. The llama2.mojo file seemed to like that. With no other modification in my llama2.mojo file; such as changing how certain layers are read and leaving in those print line statements. Below is my terminal output.

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16

checkpoint size:  13830574336
771
774
776
Killed

I did implement further print line statement to narrow down where this error was. The error is at var state = RunState(config) or at least that is where it starts. Still was killed at 776, did not print anything further.

Here is the modified fn main:

fn main() raises:
    print("num hardware threads: ", num_cores())
    print("SIMD vector width: ", nelts)
    var tokenizer = StringRef("tokenizer.bin")
    var checkpoint = StringRef("openhermes-13b.bin")
    var temperature = 0.9
    var steps = 256
    var prompt = String("")
    var rng_seed: Int = time.now()

    @parameter
    fn argparse() raises -> Int:
        let args = argv()
        if len(args) < 2:
            return 0
        checkpoint = args[1]
        for i in range(2, len(args), 2):
            if args[i] == "-p":
                print("Option not supported: ", args[i])
            if args[i] == "-n":
                steps = atol(args[i + 1])
            if args[i] == "-s":
                rng_seed = atol(args[i + 1])
            if args[i] == "-i":
                prompt = args[i + 1]
            if args[i] == "-t":
                let val = args[i + 1]
                temperature = 0.0
                # hacky parse float, keep only 1 digit
                for c in range(0, len(val)):
                    if val[c] == ".":
                        temperature += atol(val[c + 1]) * 0.1
                        break
                    else:
                        temperature = atol(val[c])
                if temperature < -1e9 or temperature > (1 + 1e9):
                    print("Wrong temperature value", temperature)
                    return 0
        return 1

    let res = argparse()
    if res == 0:
        print_usage()
        return

    random.seed(rng_seed)
    var fbuf: FileBuf = FileBuf()
    var tbuf: FileBuf = FileBuf()
    var config: Config = Config()

    read_file(checkpoint, fbuf)
    print("checkpoint size: ", fbuf.size)
    config_init(config, fbuf)

    # negative vocab size is hacky way of signaling unshared weights. bit yikes.
    let shared_weights = 1 if config.vocab_size > 0 else 0
    config.vocab_size = (
        -config.vocab_size if config.vocab_size < 0 else config.vocab_size
    )

    let weights: TransformerWeights = TransformerWeights(config, shared_weights, fbuf)

    var tok: Tokenizer = Tokenizer(config.vocab_size)

    if steps <= 0 or steps > config.seq_len:
        steps = config.seq_len

    print("771")

    # Read in the tokenizer.bin file
    read_file(tokenizer, tbuf)
    print("774")
    tokenizer_init(tok, tbuf)
    print("776")
    # Create and initialize the application RunState
    var state = RunState(config)
    print("779")
    # Process the prompt, if any
    var prompt_tokens = DynamicVector[Int]()
    print("782")
    if prompt:
        bpe_encode(prompt_tokens, prompt, tok)
    print("785")
    # Start the main loop
    var start = 0  # Used to time our code, only initialized after the first iteration
    var next_token = 0  # Will store the next token in the sequence
    # Initialize with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
    var token = 1

    # Position in the sequence
    var pos = 0
    while pos < steps:
        # Forward the transformer to get logits for the next token
        transformer(token, pos, config, state, weights)
        print("797")
        if pos < len(prompt_tokens):
            next_token = prompt_tokens[pos]
            print("800")
        else:
            # Sample the next token
            if temperature == 0.0:
                # Greedy argmax sampling: take the token with the highest probability
                next_token = argmax(state.logits)
                print("806")
            else:
                # Apply the temperature to the logits
                for q in range(config.vocab_size):
                    state.logits[q] = state.logits[q] / temperature
                    print("811")
                # Apply softmax to the logits to get the probabilities for the next token

                softmax(state.logits.data, config.vocab_size)
                # Sample from this distribution to get the next token
                next_token = sample(state.logits)

        var token_str: PointerString = tok.vocab[next_token]
        if token == 1 and token_str[0] == ord(" "):
            token_str = token_str.offset(1)

        print_str(token_str)

        # Advance forward
        token = next_token
        pos += 1

        if start == 0:
            start = time_in_ms()

    let end = time_in_ms()
    print("\nachieved tok/s: ", (steps - 1) / (end - start) * 1000)
tairov commented 1 year ago

13GB model.. Sounds a bit crazy for tinyllm loader 😄 I haven't tried bigger than 110M.. At the moment llama2.mojo is trying to load the full model into memory. Could you make sure you have enough memory on you WSL in Windows?

Also I'm pretty sure the performance will be awful for 3B model.. Try with smaller models first, lilke up to 1GB, then try to move-on with bigger ones.

tairov commented 1 year ago

@antmikinka add this after config_init

print(config.dim, config.seq_len, config.vocab_size)

I think you're trying to load too much data into mem

antmikinka commented 1 year ago

@tairov Added it, below is the code.

Would there be any way to extend the amount of ram? I'm running RTX 1650 - Ryzen 5 3600 - figured the mojo approach could help me interact with large LLMs. Looking into this as well, would be amazing on other platforms/models/languages.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (edit: realized that this is bitsandbytes package which is implemented already with HF transformers lol)

Memory Stats below (I have 47.9 installed on machine actually) sudo lshw -c memory

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ sudo lshw -c memory
  *-memory
       description: System memory
       physical id: 0
       size: 24GiB

free -m

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ free -m
               total        used        free      shared  buff/cache   available
Mem:           23999         457       12034           2       11506       23210
Swap:           6144          41        6102

open_llama3bv2.bin (version 0, legacy)

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  13706713628
3200 2048 4294935296
771
774
[621:621:20230914,094711.851638:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[621:621:20230914,094711.851702:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005611e3675717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005611e36732ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005611e3675def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f2a61f69520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f29d4003783
Segmentation fault

openhermes-13b.bin (its version 2, so quantized)

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  13830574336
1634415666 40 40
771
774
776
Killed

openhermes-13b-fp32v1.bin (version 1, full fp32)

antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0 -n 256 -i "hello
"
num hardware threads:  12
SIMD vector width:  16
[759:759:20230914,095323.773219:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[759:759:20230914,095323.773275:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005625316fc717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005625316fa2ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005625316fcdef (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f4170d64520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f4170ec2af7 (/lib/x86_64-linux-gnu/libc.so.6+0x1a0af7)
#5 0x00007f40e4008009
Segmentation fault
VMois commented 1 year ago

Binaries are not working by default in Mojo. Seg fault can occur because of an incorrect python path. The path is to the python library, different from an executable. Check this issue for possible solutions - https://github.com/modularml/mojo/issues/551.

For Github Codespaces, here is the answer - https://github.com/modularml/mojo/issues/551#issuecomment-1719839374

tairov commented 12 months ago

@antmikinka HF models are not supported, sorry :)

See my comment regarding other type of llms https://github.com/tairov/llama2.mojo/issues/22#issuecomment-1722418487