Open DKormann opened 2 weeks ago
I'm sure it can be more minimal than this. Isolate the exact source of the slowness, not a whole LSTM.
some more digging i found that clangcompiler seems to produce n programs of size n
from tinygrad import nn, Tensor, Device
from tinygrad.engine.realize import method_cache
from tinygrad.helpers import DEBUG
T = 80
Device.DEFAULT = "CLANG"
DEBUG.value = 3
method_cache.clear()
x = Tensor.rand(2)
for _ in range(T): x = x.sum() + x
x.realize()
simpler repro found. its something about chained reduce?
create_schedule creates ast that dont recompute over and over As i understand because
DEVICE = "CLANG"
x = Tensor([1,2.]).lazydata
del x.srcs
st = ShapeTracker((View.create((2,), strides=(0,)),))
for i in range(4):
s = x.r(ReduceOps.SUM, [0])._view(st)
x = x.e(BinaryOps.ADD, s)
sched = create_schedule([x])
for si in sched:
print_tree(si.ast[0])
lin = get_linearizer(Device[DEVICE].renderer, (si.ast[0],)).linearize()
fxn = CompiledRunner(lin.to_program())
print(fxn.p.src)
minimal repro of lstm being slow on first forward pass. could be sped up with use of .realize but should also work without as i understand seems to correlate with T^2