Open nkkbr opened 6 months ago
This is for YOCO/yoco/models/decoder/kernel/gate_recurrent.py
I assumed that this code is aimed to accelerate some calculation by triton.
after
python3 gate_recurrent.py
I got some printout:
naive time: 0.04773402214050293 triton time: 0.5681734085083008 False tensor(0.0078, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>) tensor(0.0001, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>) False tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0001, device='cuda:0', dtype=torch.float16) False tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0002, device='cuda:0', dtype=torch.float16) False
It seems that triton takes more time than naive?
This is for YOCO/yoco/models/decoder/kernel/gate_recurrent.py
I assumed that this code is aimed to accelerate some calculation by triton.
after
I got some printout:
It seems that triton takes more time than naive?