[wip] Tensor refactoring attempt

tairov commented 12 months ago

Trying to convert Matrix -> Tensor, got errors like this

mojo llama2.mojo stories15M.bin -n 256 -t 0.0
num hardware threads:  6
SIMD vector width:  16
checkpoint size:  60816028
387
391
402 run foo fn
foo
free(): invalid pointer
[296420:296420:20230916,141735.287926:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[296420:296420:20230916,141735.288032:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.  Program arguments: mojo llama2.mojo stories15M.bin -n 256 -t 0.0
 #0 0x00005625f3647957 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb957)
 #1 0x00005625f364552e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b952e)
 #2 0x00005625f364802f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc02f)
 #3 0x00007f12ac501420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #4 0x00007f12abf8a00b raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b)
 #5 0x00007f12abf69859 abort (/lib/x86_64-linux-gnu/libc.so.6+0x22859)
 #6 0x00007f12abfd426e (/lib/x86_64-linux-gnu/libc.so.6+0x8d26e)
 #7 0x00007f12abfdc2fc (/lib/x86_64-linux-gnu/libc.so.6+0x952fc)
 #8 0x00007f12abfddb2c (/lib/x86_64-linux-gnu/libc.so.6+0x96b2c)
 #9 0x00007f12340071fc
[1]    296418 abort (core dumped)  mojo llama2.mojo stories15M.bin -n 256 -t 0.0

I just invoked foo fn that's doing nothing.

        print('402 run foo fn')
        foo(state.xb, x, tmpw)
        print(403)

example

Probably mojo is trying to free some of the tensors. Maybe somehow related to lifecycles?

mikowals commented 11 months ago

There are a couple of typos that lead out of bounds when reading the pointers.

The equivalent of Matrix.size() is TensorSpec.num_elements(). Moving the offset by TensorSpec.bytecount() instead means you need to amend FileBuf.bitcast_offset_float32() to move the offset by size not size * sizeof[DType.float32]().
weights.wcls is being read from the checkpoint file but is not there. In master it is copied from token_embedding_table so the code should be TensorF32(self.token_embedding_table.data(), tspec).

Even with these changes the error above still occurs.

tairov commented 11 months ago

@mikowals thanks I changed bytecount -> num_elements. (also removed those 2 lines from the previous PR on master)

Yeah, the issue is still there.

    print(736)
    var tmpw = TensorF32(
        weights.rms_att_weight.data(), get_tspec_f32(config.dim)
    )
    foo(state.xb, state.xb, state.xb)
    print(741)
    #foo(state.xb, state.x, tmpw) # error on fn exit
    foo(state.xb, state.x, state.x) # causes error on line 740 !!!
    print(744)

foo invocation causes crashes, even before the main loop.

just got really strange behaviour this line causes error on line 740

    foo(state.xb, state.x, state.x)

num hardware threads:  6
SIMD vector width:  16
checkpoint size:  60816028
736
[356002:356002:20230921,234948.386818:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[356002:356002:20230921,234948.386867:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.  Program arguments: mojo llama2.mojo stories15M.bin -t 0.0 -n 256
#0 0x000055ebcb4cd797 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc797)
#1 0x000055ebcb4cb36e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5ba36e)
#2 0x000055ebcb4cde6f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bce6f)
#3 0x00007f12aaf40420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007f12aaa20722 free (/lib/x86_64-linux-gnu/libc.so.6+0x9a722)
#5 0x00007f122c00629d
[1]    356000 segmentation fault  mojo llama2.mojo stories15M.bin -t 0.0 -n 256

but when I comment that line , I and uncomment previous line, I got this

    print(741)
    foo(state.xb, state.x, tmpw) # error on fn exit
    #foo(state.xb, state.x, state.x) # causes error on line 740 !!!

num hardware threads:  6
SIMD vector width:  16
checkpoint size:  60816028
736
foo
741
foo
[356231:356231:20230921,235348.587596:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[356231:356231:20230921,235348.587653:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.  Program arguments: mojo llama2.mojo stories15M.bin -t 0.0 -n 256
#0 0x0000564c1c736797 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc797)
#1 0x0000564c1c73436e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5ba36e)
#2 0x0000564c1c736e6f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bce6f)
#3 0x00007ffa73e5d420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007ffa7393d722 free (/lib/x86_64-linux-gnu/libc.so.6+0x9a722)
#5 0x00007ffa0000632f
[1]    356229 segmentation fault  mojo llama2.mojo stories15M.bin -t 0.0 -n 256

I think I need to isolate the issue and file a bug with this

mikowals commented 11 months ago

I created a minimal reproduction and opened this issue.

mikowals commented 11 months ago

The resolution of the issue I raised is that this is our bug.

We are trying to work around the lack of support for read-only tensor slices or keeping a list of type Tensor so we don't need to slice. In the Matrix structure we work around this by using the 'allocated' flag to block 'self.data.free' when memory is shared.

If we wanted to try to work around this trying to create a list to hold Tensors is probably easiest. Otherwise we should probably add some comments in Matrix to make it more clear exactly how 'allocated' is working.

tairov commented 11 months ago

Tensors migration done within #39

tairov / llama2.mojo

[wip] Tensor refactoring attempt #20