Add custom ops jsd and fused_linear_jsd

Migrated from https://github.com/pytorch/benchmark/pull/2518

Test Plan:

% python run.py --op jsd,fused_linear_jsd  --num-inputs 1
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.94s/it]
  x_val    torch_jsd-latency    liger_jsd-latency    inductor_jsd-latency
-------  -------------------  -------------------  ----------------------
      0               2.1768             0.461984                0.154944
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.69s/it]
  x_val    torch_lm_head_jsd-latency    liger_lm_head_jsd-latency    inductor_lm_head_jsd-latency
-------  ---------------------------  ---------------------------  ------------------------------
      0                      73.6553                      362.348                         66.4232

pytorch-labs / tritonbench

Add custom ops jsd and fused_linear_jsd #16