Add layernorm and fix bug for embedding bwd

FindHao commented 3 weeks ago

Add layernorm from liger kernel and fix bug for embedding bwd. Disable liger kernels in internal ci.

Test Plan:

python run.py --op layer_norm --num-inputs 4 --metrics latency  
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.81s/it]
  x_val    torch_layer_norm-latency    triton_layer_norm-latency    torch_compile_layer_norm-latency    liger_layer_norm-latency
-------  --------------------------  ---------------------------  ----------------------------------  --------------------------
   1024                    0.028896                     0.024512                            0.02448                     0.023808
   1536                    0.038688                     0.034144                            0.05584                     0.033536
   2048                    0.048704                     0.043424                            0.059424                    0.043104
   2560                    0.058112                     0.05472                             0.083712                    0.054176

facebook-github-bot commented 3 weeks ago

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 3 weeks ago

@FindHao merged this pull request in pytorch-labs/tritonbench@66a7cc96eff83ea98e027cda7683e08b0cb7c437.

pytorch-labs / tritonbench

Add layernorm and fix bug for embedding bwd #32