Improved `repeat_kv` eager perf

Stack from ghstack (oldest at bottom):

-> #418

Overview This PR replaces x[:, :, :, None, :] with torch.unsqueeze(x, dim=3) to avoid unnecessary device copies and fill kernels in backward (namely, 4 fills and 4 copies with shape (bs, seq_len, n_kv_heads, head_dim)).

Traces Existing forward (CPU):

Existing backward (CPU):

Existing backward (GPU):

Each aten::slice in forward leads to a SliceBackward0, which is aten::zeros -> aten::copy_.

New forward (CPU):

New backward (CPU):

New backward (GPU): has no more fill kernels and device copies

Test Plan

Add torch.manual_seed(0) and torch.use_deterministic_algorithms(True, warn_only=False) at the beginning of main in train.py
CUBLAS_WORKSPACE_CONFIG=:4096:8 CONFIG_FILE=train_configs/llama3_8b.toml ./run_llama_train.sh with DP=8, batch size 1, and no AC
Baseline: P1441095581
New: P1441096211
Losses and memory usage match; WPS increased by 0.2-0.3% for this setup (but may be more significant with larger batch size)

pytorch / torchtitan

Improved `repeat_kv` eager perf #418