performance of gate_recurrent.py

This is for YOCO/yoco/models/decoder/kernel/gate_recurrent.py

I assumed that this code is aimed to accelerate some calculation by triton.

after

python3 gate_recurrent.py

I got some printout:

naive time: 0.04773402214050293
triton time: 0.5681734085083008
False
tensor(0.0078, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>) tensor(0.0001, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0001, device='cuda:0', dtype=torch.float16)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0002, device='cuda:0', dtype=torch.float16)
False

It seems that triton takes more time than naive?

microsoft / unilm

performance of gate_recurrent.py #1555