Closed tairov closed 11 months ago
There are a couple of typos that lead out of bounds when reading the pointers.
Matrix.size()
is TensorSpec.num_elements()
. Moving the offset by TensorSpec.bytecount()
instead means you need to amend FileBuf.bitcast_offset_float32()
to move the offset by size
not size * sizeof[DType.float32]()
.weights.wcls
is being read from the checkpoint file but is not there. In master it is copied from token_embedding_table
so the code should be TensorF32(self.token_embedding_table.data(), tspec)
.Even with these changes the error above still occurs.
@mikowals thanks I changed bytecount -> num_elements. (also removed those 2 lines from the previous PR on master)
Yeah, the issue is still there.
print(736)
var tmpw = TensorF32(
weights.rms_att_weight.data(), get_tspec_f32(config.dim)
)
foo(state.xb, state.xb, state.xb)
print(741)
#foo(state.xb, state.x, tmpw) # error on fn exit
foo(state.xb, state.x, state.x) # causes error on line 740 !!!
print(744)
foo invocation causes crashes, even before the main loop.
just got really strange behaviour this line causes error on line 740
foo(state.xb, state.x, state.x)
num hardware threads: 6
SIMD vector width: 16
checkpoint size: 60816028
736
[356002:356002:20230921,234948.386818:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[356002:356002:20230921,234948.386867:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo stories15M.bin -t 0.0 -n 256
#0 0x000055ebcb4cd797 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc797)
#1 0x000055ebcb4cb36e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5ba36e)
#2 0x000055ebcb4cde6f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bce6f)
#3 0x00007f12aaf40420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007f12aaa20722 free (/lib/x86_64-linux-gnu/libc.so.6+0x9a722)
#5 0x00007f122c00629d
[1] 356000 segmentation fault mojo llama2.mojo stories15M.bin -t 0.0 -n 256
but when I comment that line , I and uncomment previous line, I got this
print(741)
foo(state.xb, state.x, tmpw) # error on fn exit
#foo(state.xb, state.x, state.x) # causes error on line 740 !!!
num hardware threads: 6
SIMD vector width: 16
checkpoint size: 60816028
736
foo
741
foo
[356231:356231:20230921,235348.587596:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[356231:356231:20230921,235348.587653:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo stories15M.bin -t 0.0 -n 256
#0 0x0000564c1c736797 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc797)
#1 0x0000564c1c73436e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5ba36e)
#2 0x0000564c1c736e6f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bce6f)
#3 0x00007ffa73e5d420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007ffa7393d722 free (/lib/x86_64-linux-gnu/libc.so.6+0x9a722)
#5 0x00007ffa0000632f
[1] 356229 segmentation fault mojo llama2.mojo stories15M.bin -t 0.0 -n 256
I think I need to isolate the issue and file a bug with this
The resolution of the issue I raised is that this is our bug.
We are trying to work around the lack of support for read-only tensor slices or keeping a list of type Tensor so we don't need to slice. In the Matrix structure we work around this by using the 'allocated' flag to block 'self.data.free' when memory is shared.
If we wanted to try to work around this trying to create a list to hold Tensors is probably easiest. Otherwise we should probably add some comments in Matrix to make it more clear exactly how 'allocated' is working.
Tensors migration done within #39
Trying to convert Matrix -> Tensor, got errors like this
I just invoked foo fn that's doing nothing.
example
Probably mojo is trying to free some of the tensors. Maybe somehow related to lifecycles?