microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.08k stars 2.43k forks source link

performance of gate_recurrent.py #1555

Open nkkbr opened 1 month ago

nkkbr commented 1 month ago

This is for YOCO/yoco/models/decoder/kernel/gate_recurrent.py

I assumed that this code is aimed to accelerate some calculation by triton.

after

python3 gate_recurrent.py 

I got some printout:

naive time: 0.04773402214050293
triton time: 0.5681734085083008
False
tensor(0.0078, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>) tensor(0.0001, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0001, device='cuda:0', dtype=torch.float16)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0002, device='cuda:0', dtype=torch.float16)
False

It seems that triton takes more time than naive?