I loved seeing the blog post with a simple, standalone implementation of many techniques used in production to speed up LLMs. Would love to see this extended to MoE like Mixtral, which at the moment seem fairly annoying to use and hack on. Curious how torch.compile can help with these, and possible issues that might arise like graph breaks due to gating.
I loved seeing the blog post with a simple, standalone implementation of many techniques used in production to speed up LLMs. Would love to see this extended to MoE like Mixtral, which at the moment seem fairly annoying to use and hack on. Curious how torch.compile can help with these, and possible issues that might arise like graph breaks due to gating.