Closed 3DAlgoLab closed 12 months ago
I am running into the same issue as well. I seen that his llama2.c code and the export.py and tokenizer.py files are how the .bin files were generated for this llama2.mojo project. I created my own .bin files (tokenizer and model) from huggingface (openlm-research/open_llama_3b_v2). Ubuntu terimal:
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 13706713628
771
774
[199253:199253:20230913,213623.555832:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[199253:199253:20230913,213623.556029:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005595d8269717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005595d82672ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005595d8269def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f68165df520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f678400758a
Segmentation fault
Below is my main function in mojo:
fn main() raises:
print("num hardware threads: ", num_cores())
print("SIMD vector width: ", nelts)
var tokenizer = StringRef("tokenizer_open_llama_3b_v2.bin")
var checkpoint = StringRef("open_llama3bv2.bin")
var temperature = 0.9
var steps = 256
var prompt = String("")
var rng_seed: Int = time.now()
@parameter
fn argparse() raises -> Int:
let args = argv()
if len(args) < 2:
return 0
checkpoint = args[1]
for i in range(2, len(args), 2):
if args[i] == "-p":
print("Option not supported: ", args[i])
if args[i] == "-n":
steps = atol(args[i + 1])
if args[i] == "-s":
rng_seed = atol(args[i + 1])
if args[i] == "-i":
prompt = args[i + 1]
if args[i] == "-t":
let val = args[i + 1]
temperature = 0.0
# hacky parse float, keep only 1 digit
for c in range(0, len(val)):
if val[c] == ".":
temperature += atol(val[c + 1]) * 0.1
break
else:
temperature = atol(val[c])
if temperature < -1e9 or temperature > (1 + 1e9):
print("Wrong temperature value", temperature)
return 0
return 1
let res = argparse()
if res == 0:
print_usage()
return
random.seed(rng_seed)
var fbuf: FileBuf = FileBuf()
var tbuf: FileBuf = FileBuf()
var config: Config = Config()
read_file(checkpoint, fbuf)
print("checkpoint size: ", fbuf.size)
config_init(config, fbuf)
# negative vocab size is hacky way of signaling unshared weights. bit yikes.
let shared_weights = 1 if config.vocab_size > 0 else 0
config.vocab_size = (
-config.vocab_size if config.vocab_size < 0 else config.vocab_size
)
let weights: TransformerWeights = TransformerWeights(config, shared_weights, fbuf)
var tok: Tokenizer = Tokenizer(config.vocab_size)
if steps <= 0 or steps > config.seq_len:
steps = config.seq_len
print("771")
# Read in the tokenizer.bin file
read_file(tokenizer, tbuf)
print("774")
tokenizer_init(tok, tbuf)
print("776")
# Create and initialize the application RunState
var state = RunState(config)
# Process the prompt, if any
var prompt_tokens = DynamicVector[Int]()
if prompt:
bpe_encode(prompt_tokens, prompt, tok)
# Start the main loop
var start = 0 # Used to time our code, only initialized after the first iteration
var next_token = 0 # Will store the next token in the sequence
# Initialize with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
var token = 1
# Position in the sequence
var pos = 0
while pos < steps:
# Forward the transformer to get logits for the next token
transformer(token, pos, config, state, weights)
if pos < len(prompt_tokens):
next_token = prompt_tokens[pos]
else:
# Sample the next token
if temperature == 0.0:
# Greedy argmax sampling: take the token with the highest probability
next_token = argmax(state.logits)
else:
# Apply the temperature to the logits
for q in range(config.vocab_size):
state.logits[q] = state.logits[q] / temperature
# Apply softmax to the logits to get the probabilities for the next token
softmax(state.logits.data, config.vocab_size)
# Sample from this distribution to get the next token
next_token = sample(state.logits)
var token_str: PointerString = tok.vocab[next_token]
if token == 1 and token_str[0] == ord(" "):
token_str = token_str.offset(1)
print_str(token_str)
# Advance forward
token = next_token
pos += 1
if start == 0:
start = time_in_ms()
let end = time_in_ms()
print("\nachieved tok/s: ", (steps - 1) / (end - start) * 1000)
I was able to narrow down the program to the tokenizer_init
function, but I am unsure if the tok
or tbuf
may be also causing an issue.
I tried the native tokenizer, tried mine, both did not work with the model.bin I made.
Below is the tokenizer from open_llama_3b_v2 I had made, using with the stories15M.bin
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo stories15M.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 60816028
771
774
776
ndpl ~接 whatsoever special manan Ar \\‚ ch Lawreen con provided suchルたined asere -ories<0x0A>点 // f C Beyondreen Brerendumstiew Spanish‚ aroundondular().stqu trad接oplete yl on something ge==ionan{\‚ con shopping化ownvequir accomplan increase Creen接 conוeddingized siature F接ve //ndacon获 care接 C nyah‚能<0x0A>rib oegel asym Cou reportily Justethelessopleought meas接pe reducingan on something ge== C ivle flow‚ve //nd Cont doesnionREportiff ten favoriteoard‚ve //nd C asymormersOffset For or freezer oran on something ge== Cacon获 metJs poweranplements‚ chplaceholderily Johnny接ie likedportte care接 lease Donst PerfectronanStore leastah‚rafteliew poemsiff="_iewns Can describesportiff president‚
<s>
ur Football C have gameilyionandfrac de_{unchmundefined‚ur Footballiff class Wrong C vis week asstAT
achieved tok/s: 233.08957952468006
I am on Windows 10, Ubuntu 22.04. I may have incorrectly configured the tokenizer.py or export.py, although, I am not sure. If those files were made not auto configured and needed to have hardcoded values, then they would be messed up.
While I was messing around with swapping tokenizers and models, I also did notice this error before I installed llvm on my Ubuntu env (llvm-symbolizer). I did install via sudo apt-get install llvm. Below is the code before I install this, still a segmentation fault.
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 13706713628
771
774
776
[198120:198120:20230913,211303.365100:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[198120:198120:20230913,211303.365169:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0 mojo 0x000055d62938d717
1 mojo 0x000055d62938b2ee
2 mojo 0x000055d62938ddef
3 libc.so.6 0x00007fba03a90520
4 libc.so.6 0x00007fb96c00758a
Segmentation fault
Just finished trying another model from HF (teknium/OpenHermes-13B). Only created the model using export.py. I noticed that the first time I created one, did not use any version, meaning version used was 0, the legacy version with no header.
I tried version 1 model, created a 50gb file, instantly did not like that. Terminal threw this error.
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0
num hardware threads: 12
SIMD vector width: 16
[1268344:1268344:20230914,000242.889217:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[1268344:1268344:20230914,000242.889286:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0
#0 0x000055bd43817717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x000055bd438152ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x000055bd43817def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007fbae3b84520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007fbae3ce2af7 (/lib/x86_64-linux-gnu/libc.so.6+0x1a0af7)
#5 0x00007fba4c003849
Segmentation fault
I then tried the version 2, where weights are quantized. This took the 26.6gb .bin model file from HF down to 13.3gb roughly. The llama2.mojo file seemed to like that. With no other modification in my llama2.mojo file; such as changing how certain layers are read and leaving in those print line statements. Below is my terminal output.
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 13830574336
771
774
776
Killed
I did implement further print line statement to narrow down where this error was.
The error is at var state = RunState(config)
or at least that is where it starts.
Still was killed at 776, did not print anything further.
Here is the modified fn main:
fn main() raises:
print("num hardware threads: ", num_cores())
print("SIMD vector width: ", nelts)
var tokenizer = StringRef("tokenizer.bin")
var checkpoint = StringRef("openhermes-13b.bin")
var temperature = 0.9
var steps = 256
var prompt = String("")
var rng_seed: Int = time.now()
@parameter
fn argparse() raises -> Int:
let args = argv()
if len(args) < 2:
return 0
checkpoint = args[1]
for i in range(2, len(args), 2):
if args[i] == "-p":
print("Option not supported: ", args[i])
if args[i] == "-n":
steps = atol(args[i + 1])
if args[i] == "-s":
rng_seed = atol(args[i + 1])
if args[i] == "-i":
prompt = args[i + 1]
if args[i] == "-t":
let val = args[i + 1]
temperature = 0.0
# hacky parse float, keep only 1 digit
for c in range(0, len(val)):
if val[c] == ".":
temperature += atol(val[c + 1]) * 0.1
break
else:
temperature = atol(val[c])
if temperature < -1e9 or temperature > (1 + 1e9):
print("Wrong temperature value", temperature)
return 0
return 1
let res = argparse()
if res == 0:
print_usage()
return
random.seed(rng_seed)
var fbuf: FileBuf = FileBuf()
var tbuf: FileBuf = FileBuf()
var config: Config = Config()
read_file(checkpoint, fbuf)
print("checkpoint size: ", fbuf.size)
config_init(config, fbuf)
# negative vocab size is hacky way of signaling unshared weights. bit yikes.
let shared_weights = 1 if config.vocab_size > 0 else 0
config.vocab_size = (
-config.vocab_size if config.vocab_size < 0 else config.vocab_size
)
let weights: TransformerWeights = TransformerWeights(config, shared_weights, fbuf)
var tok: Tokenizer = Tokenizer(config.vocab_size)
if steps <= 0 or steps > config.seq_len:
steps = config.seq_len
print("771")
# Read in the tokenizer.bin file
read_file(tokenizer, tbuf)
print("774")
tokenizer_init(tok, tbuf)
print("776")
# Create and initialize the application RunState
var state = RunState(config)
print("779")
# Process the prompt, if any
var prompt_tokens = DynamicVector[Int]()
print("782")
if prompt:
bpe_encode(prompt_tokens, prompt, tok)
print("785")
# Start the main loop
var start = 0 # Used to time our code, only initialized after the first iteration
var next_token = 0 # Will store the next token in the sequence
# Initialize with token 1 (=BOS), as done in Llama-2 sentencepiece tokenizer
var token = 1
# Position in the sequence
var pos = 0
while pos < steps:
# Forward the transformer to get logits for the next token
transformer(token, pos, config, state, weights)
print("797")
if pos < len(prompt_tokens):
next_token = prompt_tokens[pos]
print("800")
else:
# Sample the next token
if temperature == 0.0:
# Greedy argmax sampling: take the token with the highest probability
next_token = argmax(state.logits)
print("806")
else:
# Apply the temperature to the logits
for q in range(config.vocab_size):
state.logits[q] = state.logits[q] / temperature
print("811")
# Apply softmax to the logits to get the probabilities for the next token
softmax(state.logits.data, config.vocab_size)
# Sample from this distribution to get the next token
next_token = sample(state.logits)
var token_str: PointerString = tok.vocab[next_token]
if token == 1 and token_str[0] == ord(" "):
token_str = token_str.offset(1)
print_str(token_str)
# Advance forward
token = next_token
pos += 1
if start == 0:
start = time_in_ms()
let end = time_in_ms()
print("\nachieved tok/s: ", (steps - 1) / (end - start) * 1000)
13GB model.. Sounds a bit crazy for tinyllm loader 😄
I haven't tried bigger than 110M..
At the moment llama2.mojo
is trying to load the full model into memory. Could you make sure you have enough memory on you WSL in Windows?
Also I'm pretty sure the performance will be awful for 3B model.. Try with smaller models first, lilke up to 1GB, then try to move-on with bigger ones.
@antmikinka add this after config_init
print(config.dim, config.seq_len, config.vocab_size)
I think you're trying to load too much data into mem
@tairov Added it, below is the code.
Would there be any way to extend the amount of ram? I'm running RTX 1650 - Ryzen 5 3600 - figured the mojo approach could help me interact with large LLMs. Looking into this as well, would be amazing on other platforms/models/languages.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (edit: realized that this is bitsandbytes package which is implemented already with HF transformers lol)
Memory Stats below (I have 47.9 installed on machine actually)
sudo lshw -c memory
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ sudo lshw -c memory
*-memory
description: System memory
physical id: 0
size: 24GiB
free -m
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ free -m
total used free shared buff/cache available
Mem: 23999 457 12034 2 11506 23210
Swap: 6144 41 6102
open_llama3bv2.bin (version 0, legacy)
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 13706713628
3200 2048 4294935296
771
774
[621:621:20230914,094711.851638:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[621:621:20230914,094711.851702:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo open_llama3bv2.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005611e3675717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005611e36732ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005611e3675def (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f2a61f69520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f29d4003783
Segmentation fault
openhermes-13b.bin (its version 2, so quantized)
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b.bin -s 99 -t 1.0 -n 256 -i "hello"
num hardware threads: 12
SIMD vector width: 16
checkpoint size: 13830574336
1634415666 40 40
771
774
776
Killed
openhermes-13b-fp32v1.bin (version 1, full fp32)
antmikinka@Antwon-XinFin-Node:~/llama2.mojo$ mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0 -n 256 -i "hello
"
num hardware threads: 12
SIMD vector width: 16
[759:759:20230914,095323.773219:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[759:759:20230914,095323.773275:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo openhermes-13b-fp32v1.bin -s 99 -t 1.0 -n 256 -i hello
#0 0x00005625316fc717 (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb717)
#1 0x00005625316fa2ee (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b92ee)
#2 0x00005625316fcdef (/home/antmikinka/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bbdef)
#3 0x00007f4170d64520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007f4170ec2af7 (/lib/x86_64-linux-gnu/libc.so.6+0x1a0af7)
#5 0x00007f40e4008009
Segmentation fault
Binaries are not working by default in Mojo. Seg fault can occur because of an incorrect python path. The path is to the python library, different from an executable. Check this issue for possible solutions - https://github.com/modularml/mojo/issues/551.
For Github Codespaces, here is the answer - https://github.com/modularml/mojo/issues/551#issuecomment-1719839374
@antmikinka HF models are not supported, sorry :)
See my comment regarding other type of llms https://github.com/tairov/llama2.mojo/issues/22#issuecomment-1722418487
Thanks for your fatastic project. For curiosity, I tried to build it as binary. It seems to be built at first. But it didn't work. It showed a message like set python path. But after I set its environment variable, a segmentation falut error occurred. I think it came from mojo builder, maybe. My enviroment is wsl in Windows 11.