Feature Request - Beam Search Decoder

Hi MLX team, I want to request a feature/example implementation of Beam Search Decoder for one of the text generation examples. The current implementations only cover Greedy and Top-P sampling. I currently implemented a naive beam search implementation, which runs on CPU and is slow with many for loops. It would be helpful if someone from your team could provide a reference implementation using MLX kernels and efficiently utilize GPU or vectorized CPU kernels.

I am happy to collaborate on this if I can get some guidance from your team.

ml-explore / mlx-examples

Feature Request - Beam Search Decoder #846